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We describe the WFCAM Science Archive (WSA), which is the primary point of 
access for users of data from the wide-field infrared camera WFCAM on the United 
Kingdom Infrared Telescope (UKIRT), especially science catalogue products from the 
UKIRT Infrared Deep Sky Survey (UKIDSS). We describe the database design with 
emphasis on those aspects of the system that enable users to fully exploit the survey 
datasets in a variety of different ways. We give details of the database-driven cu- 
ration applications that take data from the standard nightly pipeline-processed and 
calibrated files for the production of science-ready survey datasets. We describe the 
fundamentals of querying relational databases with a set of astronomy usage examples, 
and illustrate the results. 

Key words: astronomical databases - surveys: infrared - stars: general - galaxies: 
general - cosmology: observations 



1 INTRODUCTION 

The term 'science archive' is first seen in the astronomy lit- 
erature in Barrett (1993) which describes the High Energy 
Astrophysics Science Archive Research Centre (HEASARC). 
This system is much more than a simple repository of 
data - HEASARC provides an online resource to en- 
able scientific exploitation of high-energy astronomy mis- 
sions via provision of science data, software, analysis tools 
and descriptive information. For example, the data hold- 
ings in the HEASARC amount to many terabytes (TB; 
1 TB=10^'^ bytes) so wholesale download is impractical; 
recognising this, a server-side analysis facility (i.e. a facil- 
ity co-located with the data and hence remote to the typi- 
cal user) is provided to enable large-scale processing given 
an arbitrary astronomical usage scenario. In this way, data 
download is limited to user-defined subsets, sometimes pro- 
cessed in a manner specified by the user at access time. 

The advent of the large Schmidt photographic plate 



digitisation programmes (Hambly et al. 2001a and refer- 
ences therein) and infrared surveys such as DENIS (Epchtein 
et al. 1994) and 2MASS (Kleinmann et al. 1994) presented 
similar challenges for ground-based missions. Data distribu- 
tion for the digitised Schmidt surveys was originally done on 
removable, permanent storage media ('compact disc' read- 
only memory), but this became impractical so online ser- 
vices rapidly developed for these also. However, it is prob- 
ably fair to say that it was with the challenges posed by 
the Sloan Digital Sky Survey (SDSS; York et al. 2000) that 
the ground-based astronomical science archive became fully 
developed (Gray et al. 2002). The first SDSS (the so-called 
Sloan Legacy Survey) is now complete, and ~ 2 x 10^ sources 
have been measured and characterised, producing a cata- 
logue of several TB in size with associated imaging data 
and ~ 10® spectra amounting to a total volume of ~ 10 TB 
(Adelman-McCarthy et al. 2007 and references therein); the 
state-of-the-art SDSS science archive is described in Thakar 
et al. (2003a). 
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The challenges and opportunities presented by the cur- 
rent generation of ground-based infrared surveys were noted 
by Lawrence et al. (2002). In particular, they cited the ad- 
vent of a new wide-field camera for the 4-m United King- 
dom Infrared Telescope (WFCAM for UKIRT; Casali et 
al. 2007) and the even greater challenges posed by the new 
dedicated 4-m telescope for infrared surveys VISTA (Emer- 
son 2001). These ambitious survey missions gave rise to a 
systems-engineered data management project, the VISTA 
Data Flow System (VDFS; Emerson et al. 2004) which in- 
cluded provision of pipeline processing and science archiving 
for WFCAM and VISTA data. Here, we concentrate on the 
first VDFS science archive known as the WFCAM Science 
Archive (WSA). From the outset, the design of the WSA has 
been science-driven with the main science stake-holders be- 
ing users of the UKIRT Infrared Deep Sky Survey (UKIDSS; 
e.g. Warren 2002). 

This paper is one of a set of five which provide the refer- 
ence technical documentation for UKIDSS, although it is of 
direct relevance to any user of the WFCAM Science Archive. 
The other four papers in the series describe the infrared sur- 
vey instrument itself (WFCAM; Casali et al. 2007); the WF- 
CAM photometric system (Hewett et al. 2006); the UKIDSS 
surveys (Lawrence et al. 2007); and the pipeline processing 
system (Irwin et al. 2007). 

This paper is arranged as follows. In Section [2] we de- 
scribe the design of the WSA, concentrating on the develop- 
ment of the science requirements into data models (i.e. the 
database design) as presented to the end-user at access time. 
In Section [3] we discuss various detailed implementation is- 
sues that in particular inform the user as to how science- 
ready survey catalogues are generated from the standard 
flat file products processed by the nightly pipeline. Section|4] 
then goes on to discuss some illustrative science examples by 
concentrating on the expression of certain specific science 
usage modes in Structured Query Language (SQL), the lin- 
gua franca of relational database users. Following the usual 
conclusion, acknowledgements and bibliography we present 
as appendices some supplementary information to aid first- 
time users of the WSA. 



2 DESIGN 

In this paper, we concentrate on those aspects of the de- 
sign that are relevant to the end user, assumed to be an 
astronomer interested in exploiting the archive for the pur- 
poses of scientific research. Further background informa- 
tion, and in particular technical details of the Information 
Technology aspects of the overall VISTA Data Flow System 
can be obtained from a set of papers appearing in recent 
volumes of the Astronomical Data Analysis and Software 
Systems (ADASS) and the International Society for Opti- 
cal Engineering (SPIE) publications series - see Hambly et 
al. (2004a), Collins et al. (2006), Emerson et al. (2006) and 
Cross et al. (2007). The design of the WSA is based, in part, 
on that of the science archive system for the SDSS (Thakar 
et al. 2003a and references therein). In particular, we have 
made extensive use of the relational design philosophy of 
the SDSS science archive, and have implemented some of 
the associated software modules (e.g. that for the compu- 
tation and use of Hierarchical Triangular Mesh indexing of 



spherical coordinates - see Kunszt et al. 2000). Scalability 
of the design to terabyte data volumes was prototyped using 
our own existing legacy Schmidt survey dataset, the Super- 
COSMOS Sky Survey (Hambly et al. 2001a and references 
therein). The resulting prototype science archive system, the 
SuperCOSMOS Science Archive (SSA) is described in Ham- 
bly et al. (2004b) and provides an illustration of the contrast 
in end-user experience of an old-style survey interface (as 
described in Hambly et al. 2001) and the new. Note that 
extensive technical design documentation for the WSA is 
maintained onlinqj. 

The following Sections provide more information on the 
design to a level of detail that will enable a general user 
of the WSA to understand and to get the highest possible 
return out of the system. 



2.1 Background 

The WSA is a system designed to store, curate and serve 
all observations made by WFCAM, which is described in 
detail in Casali et al. (2007). The infrared active part of the 
focal plane consists of four 2048 x 2048 detectors with plate 
scale 0.4" pix^^ arranged in a square pattern and spaced 
by 94% of the detector width (e.g. Casah et al. 2007, Fig- 
ure 2). Hence, a sequence of four pointings is required to 
produce contiguous areal coverage of 0.78 sq. deg. (this is 
sometimes called a tile); however the unit of WSA cura- 
tion (e.g. frame association for source merging - see later) 
is based around images of the size of one detector (known 
as a detector frame) . Such an image is usually the result of 
stacking of a set of dithered and/or microstepped individual 
exposures (known as normal frames in the VDFS). Dither- 
ing (also known as jittering) is typically executed in step 
patterns of several arcseconds about a base position to al- 
low for the removal of poor quality pixels at the processing 
stage. Microstepping, on the other hand, is sometimes used 
to recover full PSF sampling as the image quality delivered 
by WFCAM/UKIRT often can be better than the Nyquist 
limit of ~ 0.8" given the 0.4" WFCAM pixels. WFCAM in- 
strument performance is concisely summarised in Casah et 
al. (2007), Table 3: e.g. median (best) image quality is 0.7" 
(0.55") at zenith at K band. 

Observing time with WFCAM on UKIRT is divided 
between large scale surveys (i.e. UKIDSS and the recently 
instigated 'campaigns'), smaller Pl-led projects (awarded 
time via a telescope time allocation group), 'service' mode 
observations for very small projects requiring only a few 
hours of time, and special projects like observatory /survey 
infrastructure (calibration) and director's discretionary time 
projects. Data from all these is tracked in the WSA, but 
the design is dictated primarily by the largest surveys, 
i.e. UKIDSS, which is described in detail in Lawrence et 
al. (2007). 

Briefiy, UKIDSS consists of a hierarchy of five sur- 
veys that trade depth versus area to cover a multitude 
of science goals. The Large Area Survey (LAS) aims to 
cover ~ 4000 sq. deg. in four infrared passbands to depths 
Y ~ 20.3, J ~ 19.8, H ~ 18.6 and K ~ 18.2 with two epochs 
of coverage at J. The Galactic Plane Survey (GPS) aims 
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to cover ~ 1900 sq. deg. to depths J ~ 19.9, H ~ 19.0 
and K ~ 19.0 with two (originally three) epochs of coverage 
at K and some coverage at narrow-band H2. The Galactic 
Clusters Survey (GCS) will survey ten open-cluster/star- 
formation regions to a total of ~ 1000 sq. deg. to depths 
Z ~ 20.4, Y ~ 20.3, J ~ 19.5, H ~ 18.6 and K ~ 18.6 
with two epochs of coverage at K. The Deep eXtragalac- 
tic Survey (DXS) aims to survey four selected areas to a 
total of ~ 35 sq. deg. to depths J ~ 22.3, H ~ 21.8 and 
K ~ 20.8. Finally, the Ultra Deep Survey (UDS) aims to sur- 
vey ~ 0.8 sq. deg. to depths J ~ 24.8, H ~ 23.8 and K ~ 22.8. 
UKIDSS LAS (J), GPS (JHK), GCS (K) and DXS (JK) em- 
ploy 2x2 microstepping (yielding 0.2" samples) while the 
UDS employs 3x3 microstepping in all filters (yielding 0.13" 
samples). In the VDFS, an image resulting from interleav- 
ing microstepped frames is known as a leav frame while an 
image resulting from stacking a set of dithered exposures is 
known as a stack. An interleaved, stacked image is known as 
a leavstack frame - for many more details of VDFS pipeline 
processing, see Irwin et al. (2007). Survey data quality ob- 
tained in practice is summarised in UKIDSS data release 
papers (e.g. Dye et al. 2006; Warren et al. 2007a). Median 
seeing at Data Release 1 was ~ 0.83"; uniformity of photo- 
metric calibration as estimated via field-to-field scatter was 
between 0.02 and 0.03 mag in Y-J, J-H and H-K; mean stel- 
lar ellipticity was ~ 0.08. Observing strategies for UKIDSS 
are discussed extensively in Lawrence et al. (2007). Tiling 
the wide, shallow surveys, especially at high Galactic lati- 
tude, is dictated largely by the availability of suitable guide 
stars (V < 17; Casali et al. 2007). This results in varying 
degrees of frame overlap and non-uniform tiling. The WSA 
copes with this via a data-driven source merging philoso- 
phy, and a flexible seaming algorithm for the production 
of interim catalogue products during the 7 year UKIDSS 
observing campaign, as is required to maximise timely sci- 
entific exploitation. Furthermore, a requirement exists for 
associating multiple-epoch visits of the same field, in addi- 
tion to merely associating different passband visits. Again, a 
database-driven application ensures that sensible frame as- 
sociations are made in the presence of incomplete datasets 
when intermediate releases are required before full survey 
completion. 

In WSA parlance, UKIDSS as a whole is referred to 
as a survey while the LAS, GPS, DXS etc. are known as 
programmes (the rest, including Pl-led programmes, are 
known as non-survey programmes). For the purposes of 
book-keeping at the observatory, observing is broken up 
into chunks known as projects which have a unique name 
that may include a Semester identiflcation (e.g. u/07a/32 
for non-survey Pl-led programme no. 32 in Semester 07A; 
u/ukidss/gcs5 for UKIDSS GCS project observing set no. 5). 
The various survey and non-survey processed datasets 
stored and served in the WSA have proprietary periods 
ranging from 12 months for non-survey programmes to 18 
months for the larger campaigns and surveys. These periods 
run from the time at which the processed data are made 
available to the respective proprietors rather than individual 
frame observing dates. Note that UKIDSS is proprietary to 
astronomers in the European Southern Observatory member 
states, while campaign and non-survey datasets are propri- 
etary to the respective Pis and their named collaborators. 



2.2 Archive requirements 

A set of top-level general requirements was established early 
in the history of the WSA projeco Briefly, requirements 
were specifled in the following broad categories: i) top- 
level; ii) general contents and functions; iii) detailed func- 
tional requirements; and iv) security. Examples include i) 
broad-brush requirements concerning flexibility, scalability, 
ease-of-use and scope, e.g. the WSA is required to hold all 
pipeline-processed WFCAM data, not just that belonging 
to survey programmes (UKIDSS); ii) minimal requirements 
concerning contents and functionality, e.g. contents to in- 
clude pixel, catalogue and associated metadata, along with 
calibration data; iii) a set of detailed functional requirements 
from the point of view of the end-user, e.g. searching and vi- 
sualisation functionality required in the user interface; and 
finally iv) security rules concerning protection of the data 
itself, its integrity and any proprietary rights thereof. 

In order to progress the design of the WSA from the 
top-level generalities summarised above, we followed a ra- 
tional process similar to that employed in the design of the 
science archive for the SDSS (Thakar et al. 2003a), viz. the 
development of a set of questions and usage modes that one 
would ask or require of the archive to fulfill the functional re- 
quirements previously identified. This may seem somewhat 
ad hoc compared with a standard, 'unified rational process' 
(such as is encapsulated in Unified Modelling Language de- 
sign, e.g. Gaessler et al. 2004 and references therein) but it 
has been successfully employed in the past (not least in the 
SDSS science archive design) and is rather powerful, despite 
its relative simplicity. We developed a set of 20 curation 
usage modes for the WSA and a set of 20 end-user usage 
modes in collaboration with the UKIDSS user community 
(see Appendix IX}. These were then analysed along with the 
original top-level requirements to produce a requirements 
analysis document to inform the detailed design described 
below. The design documents are all available onlinqj. 



2.3 Design fundamentals 

The WSA receives processed data from the pipeline com- 
ponent of the overall data fiow system in the form of 
FITS (Hanisch et al. 2001) image and catalogue binary ta- 
ble files (Irwin et al. 2007). No raw pixel data are held 
in the WSA. Processed data consists of instrumentally- 
corrected WFCAM frames, associated descriptive and cal- 
ibration data (including confidence frames and calibration 
images, e.g. darks and flats) and single-passband detection 
lists derived from the science frames. Calibration informa- 
tion also includes astrometric and photometric coefficients 
derived using the 2MASS (Skrutskie et al. 2006) point source 
catalogue as a reference. Metadata, meaning in this con- 
text those data that describe the imaging observations and 
processing thereof (e.g. observing dates/times, filters, in- 
strument state, weather conditions, processing steps, etc.), 
are defined by a set of descriptor keywords agreed between 
the archive and pipeline centres, and include all informa- 
tion propagated from the instrument and observatory, along 

^ http : //www . j ach . hawaii . edu /UKIRT /management /wds/] 

requirements/wf arcrq.html 
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with additional keywords that describe the processing ap- 
pUed to each image and catalogue in the pipeline. Single 
passband detection catalogues for each science image have 
a standard set of 80 photometric, astrometric and morpho- 
logical attributes along with error estimates, and a variety 
of summary quality control measures (e.g. seeing, average 
point source ellipticity etc.). For many more details, see Ir- 
win et al. (2007). 

The design of the WSA was based, from the outset, 
on a classical client-server architecture employing a third- 
party back-end commercial database management system 
(DBMS). This followed similar but earlier developments for 
the SDSS science archive, and reflects the great flexibility of 
such a system from the point of view of both applications 
development and end-user querying. Furthermore, although 
originally built (Szalay et al. 2000) on an 'object oriented' 
database|j issues with performance and ease of use by the 
end user led to a switch to a relational database manage- 
ment system (RDBMS; a system that presents data as a 
group of related tabular data sets) in that project (Thakar 
et al. 2003b) and the WSA has been based on the rela- 
tional model from the start. This brings many advantages 
for astronomy applications (indeed, for applications in any 
scientiflc discipline) where related sets of tabular informa- 
tion are familiar. Such advantages will be illustrated below; 
at this point we emphasise a few fundamental aspects of the 
relational design. 



2.3.1 Default values and 'not null' 

As always in database design, a decision has to be made as to 
how to deal with the situation when no measurement is avail- 
able to populate a particular field of a given row. For exam- 
ple, it may be that the data model (see later) requires that a 
merged source table has columns for infrared colours (J-H). 
What happens when H, or J, or even both are unavailable 
for that particular source (perhaps the images in these filters 
have not been taken yet, but we require to allow users access 
to the data that do exist for this source - e.g. observations 
in other filters)? This particular attribute, (J-H), could be 
set to a specified default value (an appropriately out-of- 
range but nonetheless real number, say —0.999999 x 10^) or 
it can be allowed to be undefined ('null') in the RDBMS. 
One of the (many) problems with null values is that they 
complicate querying of the database: it is easier and clearer 
to ask "give me all the objects with (J-H) in the range 0.5 
to 1.0" than it is to ask the same question with the addi- 
tional predicate "and (J-H) is not null" (necessary because 
the RDBMS returns null values in results sets as a standard 
data type to be handled by querying applications). By judi- 
cious choice of default values, we can force exclusion of those 
rows where no measurement is available in an explicit and 
clear manner (in this case because the default value is out- 
with the range of a typical colour selection) thus simplifying 
querying applications. In this simple example this may seem 
rather unimportant but in more complicated situations the 



* a system that presents database objects (tables, rows, columns, 
constraints etc.) as programming language 'objects' (i.e. entities 
encapsulating both data and programming functionality) to client 
applications 



Table 1. Default values for various data types in the WSA 
database. 



Default value 



Data type 



—0.9999995 x 10® Floating point (single/double precision) 

-99999999 Integer (4- and 8-byte) 

-9999 Integer (2-byte) 

NONE Character 

9999-Dec-31 Date-times 



use of default values can greatly simplify querying applica- 
tions and, as we describe later, the WSA philosophy is to 
expose the full power of the RDBMS to the end user for 
complete flexibility in querying. The WSA employs the de- 
fault values as specified in Table[T]for the various data types 
listed, and does not allow null attributes in any column of 
any table. 



2.3.2 Physical units 

Physical quantities in the WSA are stored in SI units wher- 
ever possible. Astronomical convention dictates the usual 
standards for many astrophysical quantities; a conventional 
magnitude scale on the natural WFCAM system (Hewett 
et al. 2006) is employed for calibrated fluxes. All times- 
tamps employed in the data flow system, including the 
science archive, are 'Universal Time Coordinate' (UTC) 
date/times. Spherical coordinates are stored in equatorial 
(J2000.0 equinox). Galactic and SDSS (A,r;) coordinates 
(Stoughton et al. 2002) for ease of querying in different sys- 
tems, and all angles (RA, Dec etc.) are stored in units of 
decimal degrees apart from a small number of image at- 
tributes that map directly to FITS keywords delivered by 
the pipeline. Equatorial coordinates at equinox J2000.0 are 
labelled with a 20-level Hierarchical Triangular Mesh index 
(Kunszt et al. 2000) to make spatially limited queries effi- 
cient. 



2.3.3 Miscellaneous fundamentals 

Pixel data are stored as fiat files in the WSA system, rather 
than as 'binary large objects' in DBMS tables. This is so that 
high data volume usages (i.e. those requiring access to pixel 
data) that are not time-critical will not impact catalogue 
querying, where more 'real-time' performance is required for 
data exploration and interaction. However, pixel file names 
and the pixel metadata are tracked in tables within the 
DBMS so that the image descriptors can be browsed and 
queried in the same way as, or in conjunction with, cata- 
logue data. 

The WSA is organised as a self-describing database. 
This means that curation information, i.e. information per- 
taining to database-driven activities (for both invocation 
and results logging) in preparation of science-ready data 
products (see later) is contained in the database, along with 
science data. For example, the requirements for source merg- 
ing for a survey programme (the filter selection and the num- 
ber of passes in each filter, the source pairing criterion, etc.) 
are stored in database tables to drive the relevant curation 
activity and to inform users of the procedure. 
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2.4 The WSA relational model 

A good design for a relational database captures the struc- 
ture inherent in the data to be stored, thereby aiding cura- 
tion operations and end-user query modes, as these are both 
likely to reflect that structure. In conventional relational de- 
sign, this structure is captured in an entity-relational model 
(ERM), in which a collection of related data is represented 
by an entity and entities have relationships between them, 
which can be mandatory or optional and can have one of 
three cardinalities (one-to-one, one-to-many or many-to- 
many). 

To illustrate this, consider a processed WFCAM im- 
age file, as delivered by the pipeline (Irwin et al. 2007). 
Such a multi-extension FITS (MEF) file consisting of a pri- 
mary header-data unit with generic descriptive keywords 
(observation date/time, filename, telescope/instrument pa- 
rameters etc.) and a set of extensions containing the images 
and corresponding descriptive data of individual detectors 
can be represented in relational terms as in Figure [l] Here, 
we identify entities Multiframe and MultiframeDetector 
and a one-to-many relationship between them, each entity 
containing attributes that describe it. A particularly impor- 
tant point to note here is that the arrangement of data as 
represented in Figure [T] is normalised in the sense that we 
do not duplicate attributes in entity MultiframeDetector 
that pertain to a set of individual extension frames in each 
Multiframe - e.g. we could represent the data using a sin- 
gle entity where each set of detector frames (four in the 
case of WFCAM MEF file of a typical observation) is de- 
scribed by the generic attributes in entity Multiframe in ad- 
dition to the specific attributes pertaining to each. Clearly, 
in terms of storage it is more efficient to have one record 
of the generic attributes of each set of detector frames, and 
link each MultiframeDetector to its parent Multiframe us- 
ing a label and a reference in the RDBMS. Note that there 
is no requirement here for every Multiframe to have ex- 
actly four detector frames. A mosaiced image product can 
be equally well described by this data model - there will 
simply be a single extension representing the whole image, 
and the mosaic Multiframe will simply have one related row 
in MultiframeDetector. 

In designing the WSA relational model, normalisation 
has been used except in a small number of cases where it 
makes sense to denormalise and duplicate some attributes 
for ease of use and better performance at query time; this 
is illustrated later, along with example usage modes requir- 
ing to query a set of normalised tables ('join' queries). The 
principle of normalisation complicates the data model for 
the novice user, but it is extremely important when design- 
ing a system that must scale to very large data volumes. 

Experience with the SDSS has shown that scientifically 
realistic queries often require inclusion of constraints on 
metadata parameters and selection of rows on the basis of 
their provenance (e.g. properties of their parent images). To 
do this requires the user to know the basic structure of the 
database, so, in the remainder of this Section we describe 
the principal contents of the WSA in terms of ERMs, at a 
level which will enable users to define the queries they need 
to run to do their science. 



MultiframeDetector 




Multiframe 


* image identifier 




* FITS file name 


(e.g. extension number) 




* date of observation 


* image size 
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' exposure time 
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* further attributes ... 




' further generic attributes ... 


. 







Figure 1. A simple 'entity-relationship' model (ERM) showing 
in schematic form the relationship between the generic attributes 
of a multi-extension FITS image (a Multiframe in WSA par- 
lance) and the particular attributes of each constituent image 
(MultiframeDetector) of that multiple image container file. The 
onc-to-many (in fact onc-to-four in the case of WFCAM) rela- 
tionship between these two entities is represented by the 'crows 
foot' connecting the boxes (see text for a more detailed explana- 
tion). 



2.4-1 Image data 

As noted previously, image metadata are tracked in the 
WSA database, although the image pixel data themselves 
are not ingested into the RDBMS - they are stored as 
flat files on disk. In Figure [5] we show the ERM for pixel 
data in the WSA. Each entity box represents a database ta- 
ble, and one-to-many relationships between the tables are 
shown, as before. Note that some relationships are manda- 
tory whereas some are optional. An example of a manda- 
tory relationship is that every Multiframe has one or more 
MultiframeDetectors (not unreasonably, since a MEF de- 
void of any detector frames is not particularly useful). An 
example of an optional relationship, denoted by a dashed 
line on the side where the relationship is optional, is that 
every Filter might have one or more Multiframes (again, 
not unreasonable since there may be urmsed filters present 
in WFCAM) and yet every Multiframe has to have one as- 
sociated filter record only. In this case, the mandatory rela- 
tionship implies that there must always be a link between 
the Multiframe and Filter tables, even if that link points 
to a blank filter record, or if the filter keyword in a given 
Multiframe was unavailable for some reason, then the link 
will take a default value (see previously). However, to main- 
tain referential integrity in the database there will need to 
be a default row in table Filter that can be referenced by 
the default link. This situation can occur in any part of the 
WSA data model where a mandatory relationship exists be- 
tween two tables. 

Another useful feature of these ERM schematics is the 
indication of a unique identifier using the '#' sign in the at- 
tribute list (a convention in entity-relationship modelling). 
Unique identifiers (UIDs) are, of course, key to efficient op- 
eration in a DBMS - without them, a table is simply a heap 
of data in which a specific row cannot be found easily. With 
a UID, on the other hand, every row of a table is uniquely 
labelled and can be located quickly, especially if the table 
data are sorted on that attribute (as is generally the case). 
Note that barred relationships in Figure[2]indicate where the 
combination of UIDs in both tables linked by the relation- 
ship are used, in the table on the barred side, as a combined 
UID. In the case of Multiframe and MultiframeDetector, 
for example, the UID in the former is simply a running 
number assigned on ingest, while in the latter the UID is 



6 Hamhly et al. 



MultiframeDetector 

# muttiframe ID no. 

# extension no. 
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\/ 



Dark, flat, defringe, sky, 
Confidence and object 
mask multiframes 



Combiframd 



Component frame 



7\. 



Provenance 

# combiframe ID 

# multiframe ID 



Figure 2. Relational model for image data in the WSA. Each box represents a tabic in the database; the lists of attributes in each are 
for illustration only, and are not intended to be complete. The 'crows feet' illustrate one— to-many relationships between data entries in 
each entity; dotted lines indicate optional as opposed to mandatory relationships (see text for further details). 



a combination of the parent Multiframe UID plus the ex- 
tension number - in this way, every MultiframeDetector is 
uniquely identified (it is conventional in ERMs to omit as 
'#' UID attributes those UIDs from a related table, but we 
have explicitly noted them for clarity). 

Other types of relationship are shown in Figure O and 
they illustrate how the WSA tracks the processing history, 
or provenance, of each processed image. Entity Provenance 
tracks the ancestor images of any image in the WSA that 
is the result of a combinatorial process on other images also 
tracked in the archive; hence for a Multiframe composed 
of A'' other Multiframes (e.g. a stack of individual dithered 
Multiframes) this would contain N records, each consist- 
ing of the UID of the final stack product (the attribute la- 
belled as combif ramelD) along with one of each of the N 
separate constituent Multiframe UIDs; the other optional 
one-to-many relationship between entities Provenance and 
Multiframe indicates that every component frame recorded 
in the former must be present in the latter, while every 
Multiframe may be included as a constituent frame in one or 
more combined frame products. Finally, there is an optional 
self-referencing relationship indicated in the lower right- 
hand corner of entity Multiframe. This indicates that each 
Multiframe may be a pixel value correction frame used in 
the processing of one or more science Multiframes (there 
is an attribute to distinguish between different Multiframe 
types); conversely, each Multiframe may have been pro- 
cessed using one or more of each of the correction frame 
types dark, flat, sky etc. The relationship is optional on both 
sides since, for example, a flat will not itself be calibrated 
against a flat; moreover every single calibration frame that 
is propagated through the system may not get used in the 
processing of any science frames. 

2.4- S General catalogue data model 

In Figure [3] we show a generalised ERM for catalogue data 
in the WSA. A set of five entities are identified that link 



with each other and with entity MultiframeDetector (see 
Figure [Sjl as shown. Briefiy, standard 80-parameter detec- 
tion lists from science images delivered by the pipeline (Ir- 
win et al. 2007) are tracked in entity Detection; hence 
every MultiframeDetector may give rise to one or more 
Detections with a UID that includes the UID of the former. 
End-user science requirements, however, specify that most 
science applications need a merged, multi-colour, multi- 
epoch source list for convenience, so this data model in- 
cludes an entity Source to track merged source records pro- 
duced by a standard curation procedure (see later). Each 
Source is always made up of one or more individual pass- 
band Detections. The source merging procedure operates 
on sets of MultiframeDetectors where a frame set com- 
prises detector frames taken at the same position but in 
different filters and/or at different times. These frame sets 
are tracked by entity MergeLog where every MergeLog frame 
set always consists of one or more MultiframeDetectors 
while an individual frame in the latter may or may not 
be a member of a frame set - non-science frames would 
not be included in frame sets, for example. The final two 
entities in Figure |3] are included to track enhanced cat- 
alogue extraction data from a process known colloquially 
as list-driven remeasurement. Standard pipeline processing 
treats each science image separately and extracts sources us- 
ing a set of standard apertures and adaptive profile models 
applied at positions having detections above a sky noise- 
dependent threshold as described in Irwin et al. (2007). In 
the list-driven remeasurement scenario, a frame set is re- 
analysed for photometric attributes amongst all individual 
frames in the set using a master list of sources that are 
present in the field and a single set of apertures and mod- 
els to yield photometric attributes consistently measured 
across the frame set. In this way, attributes such as colours 
are measured in a usefully consistent way, e.g. at the same 
position and with the same profile model, across all avail- 
able passbands. In many ways, entities ListRemeasurement 
and SourceRemeasurement are analogous to Detection and 
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Source respectively and hence show similar relationships be- 
tween each other and Multif rameDetector. However, every 
SourceRemeasurement is driven by one Source - this defines 
the one-to-one relationship between these entities. Futher- 
more, certain photometric attributes of the remeasurement 
entities will have slightly different meanings to their ana- 
logues in Detection and Source, most notably flux mea- 
surements at positions defined by the driving list. In or- 
der to cope with the possibility of marginally detected or 
negative fiuxes, one approach (which has yet to gain wide 
acceptance in the astronomical literature) is to adopt the 
magnitude scale of Lupton, Gunn & Szalay (1999) in the 
remeasurement entities for any calibrated fiux attributes to 
be usefully defined in such a situation. (We note that at 
the time of writing, list-driven photometry has yet to be 
implemented in the WSA). 

It is important to note that the WSA is required to 
track a number of different science programmes in which 
the prescription for source merging (i.e. the required filters 
and number of distinct epoch passes in those filters) will be 
different. Before illustrating a specific example of the appli- 
cation of the generalised catalogue ERM, we need to discuss 
the top-level data model of the WSA that describes the ob- 
servational programmes contained within it. 

2.4-3 Top-level metadata 

In order to track the various programmes for which the WSA 
is required to hold data, e.g. survey (UKIDSS), non-survey 
(private proprietary) and 'service' programmes, the set of 
entities in the schematics in Figures 2] and [S] have been iden- 
tified. Consider the UKIDSS survey, which consists of five 
sub-survey components. Once again, in simple relational 
terms we identify entity Survey with a mandatory one-to- 
many relationship to a set of Programmes, e.g. the UKIDSS 
LAS, GPS, GCS etc., with each Programme consisting of one 
or more Multiframes. Note, however, in this case the rela- 
tionships between Survey and Programme, and Programme 
and Multiframe, are propagated via two further entities, 
SurveyProgrammes and ProgrammeFrame where the latter 
have optional or mandatory many-to-one relationships with 
their linked entities. In the case of entity Programme, the gen- 
eralisation in its relationship to Multiframe allows each im- 
age datasot in the latter to belong to none, one, or more than 
one Programme. This is useful, for example, in the UKIDSS 
GPS and GCS Programmes which overlap in their surveyed 
areas, filter coverage and depth, and it is clearly advanta- 
geous to use the same data for both rather than duplicate 
survey observations. Note also that entity RequiredFilter 
in Figure |4] specifies the prescription for source merging for 
a given Programme, where every Programme may have one or 
more RequiredFilters specified. For example, the UKIDSS 
LAS requires filter combination YJHK with two passes at 
J, whereas certain non-survey Programmes may not require 
source merging at all. Every RequiredFilter must of course 
reference an existing Filter, hence the mandatory many- 
to-one relationship between those two entities. Finally, en- 
tity Release tracks information about releases that have 
occured for a given survey; every Survey m,ay have one or 
more releases. 

Figure [5] shows the other main aspects of the WSA 
top-level data model with relevance to the end-user. The 



WSA holds local copies of external datasets, from vari- 
ous sources, as specified by the UKIDSS consortium early 
in the requirements capture phase of the project. These 
large datasets were anticipated as being essential to cer- 
tain science applications of the infrared surveys, and in- 
clude the Sloan Digital Sky Survey catalogue data releases, 
e.g. Data Releases 2, 3 and 5 (Abazajian et al. 2004; 
Abazajian et al. 2005; Adelman-McCarthy et al. 2007); 
the 2MASS point and extended source catalogues (Skrut- 
skie et al. 2006); and the SuperCOSMOS Science Archive 
database (e.g. Hambly et al. 2004b). The data model in Fig- 
ure [S] illustrates that every ExternalSurvey consists of one 
or more ExternalSurveyTables (e.g. 2MASS contains dis- 
tinct point and extended source tables) and every Programme 
has one or more ProgrammeTables that are required to be 
joined in pairs as specified in RequiredNeighbours (the 
joining philosophy and procedure is discussed further in 
Section 12.4.61 and in detail in Section I3.4.4|l . For example, 
the science requirements for the UKIDSS LAS specify that 
the LAS merged source list should be joined to the corre- 
sponding list in the SDSS. The generalisation using entities 
ProgrammeTable and ExternalSurveyTable allow for arbi- 
trary joins between any tables in the linked surveys rather 
than linking Programme and ExternalSurvey directly which 
would result in only one join being allowed for each pair of 
Surveys. 



2.4-4 Exam,ple data model for programm,e catalogue data 

The previous Section illustrates the hierarchy of Surveys, 
Programmes and their associated descriptive data model. It 
should be clear now that a distinct entity for each of Source, 
MergeLog and SourceRemeasurement (Figure ^ is implied 
for every Programme, since the prescription for source merg- 
ing in RequiredFilter will be different in each case and 
the attribute sets in these three merged source entities will 
be different (imposing the same attribute set on all merged 
source entities would necessitate a large number of defaults, 
i.e. unused attributes, for most). In fact each UKIDSS sur- 
vey Programme tracked in the WSA has the set of five enti- 
ties shown in Figure |3] for the purposes of storing catalogue 
data. This is because the single passband entities Detection 
and ListRemeasurement are closely related to their respec- 
tive merged source entities within a given programme, and 
because it can aid performance and housekeeping if large 
datasets are split into related subsets (in addition to clar- 
ifying the data model for the end user). In Figure [6] we 
give a specific example of the catalogue data model for the 
UKIDSS LAS. (Note that non-survey Programmes do not 
include remeasurement and merged source entities unless 
these are requested by their Pis). In addition to the gen- 
eral description already given in Section 12.4.21 it is worth 
noting at this point that we denormalise lasDetection and 
lasSource in that a small subset of the most useful single 
passband photometric attributes are copied from the former 
into the latter to facilitate simple end-user querying of what 
are anticipated to be the main science tables for the survey 
datasets, in this case the merged source table lasSource. For 
more details concerning source merging, see Section [3.4.21 
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MultiframeDetector 

- see image ERM 
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MergeLog 

- keeps a record of the images that have been 
included so far in the merged Source entity 
in terms of frame sets 



^ 



Li stRe measurement 

- generated by placing apertures at 
externally specified positions on the 
image 

- instrumental quantities and 
calibrated using current coeffs 



\/ 



SourceRemeasurement 

- new merged source list based on merged 
detection list; all passband attributes 
remeasured using one prescription 

- calibrated ptiotometry using asinh 
mag scale 



Detection 

- standard processing produces a list 
of detections on each image 
- instrumental quantities and 
calibrated using current coeffs 



Source 

- merging procedure merges 
detections into a single 
multi-epocfi, multi-colour 
record; not all passbands will 
fiave detections available 

- entianced astrometric attributes 

- calibrated quantities (astrom and 
pfiotom) plus subset of detection 
attributes concerning morphology 
etc ("most useful" attributes) 

- individual passband mags; colours 

- primary/secondary "seamless 
catalogue" flag attribute 



Figure 3. Generalised relational model for catalogue data in the WSA (see text). 



2.4-5 Calibration data model 

Pipeline processing delivers instrumental astrometric and 
photometric attributes and calibration coefficients (Irwin 
et al. 2007). For example, each single passband detection 
comes with an (x,y) coordinate location, and a set of FITS 
World Coordinate System (WCS; Calabretta & Griesen 
2002) comes with each image for transformation to celes- 
tial coordinates. Photometric attributes are also supplied as 
instrumental fluxes along with a set of calibration coeffi- 
cients for each image (zeropoints, aperture corrections, etc.) 
to be applied to put the photometric quantities on a stan- 
dard magnitude scale. The WSA stores all this information, 
and stores calibrated quantities according to the the current 
calibration in further attributes for ease of use. Hence, en- 
tity Detection (Figure[2| contains (x,y) and flux attributes 
along with (RA, Dec, I, b, A, rj^j celestial coordinates and 
a calibrated magnitude for every flux (and flux error) at- 
tribute. 

The advantage of storing instrumental quantities and 
calibration coefficients is that updates to the calibration 
can be tracked - e.g. at some point in the future, when 
a greater understanding of the WFCAM instrumental be- 
haviour has been gained and a much larger amount of data 
is available, it may be possible to recalibrate astrometry and 
photometry. Moreover, for photometry in particular, addi- 
tional calibration constraints (e.g. over many nights, or em- 
ploying overlap regions between adjacent frames) are avail- 
able within the WSA that are not easily implemented in 
nightly pipeline processing. In Figure [7] we show the rela- 



^ (A, -q) are spherical polar survey co-ordinates defined for the 
SDSS 



tional model for astrometric calibration data to illustrate the 
approach (photometric coefficient attributes are contained 
within the entities Multiframe and MultiframeDetector al- 
ready identified in Figure [2]). Astrometric calibration coeffi- 
cients are stored in entity CurrentAstrometry which has 
a one-to-one relationship with MultiframeDetector, op- 
tional on the side of the latter. These coefficients, and some 
attributes calibrated using them, are gathered together in 
this entity to make recalibration more efficient; the optional 
relationship with Multiframe reflects the fact that not all 
frames are necessarily astrometric (e.g. darks). The other 
two entities are included to track recalibration (if/when 
that occurs): each MultiframeDetector may have one or 
more PreviousAstrometry calibrations; and each of the lat- 
ter must be identified with an AstrometryVersion. These 
last two entities are unlikely to be of use to the end user 
but are included to illustrate the recalibration aspect of the 
WSA functionality. Similarly, instrumental photometric cal- 
ibration attributes are unlikely to be used in most end-user 
usage modes. 



2.4.6 Data m,odel for neighbouring sources from catalogue 
joins 



As already indicated in FigureOand Section[2A3l the WSA 
is required to hold local copies of large survey datasets pro- 
duced elsewhere to facilitate cross-matched usage modes 
within the archive system. In the general case, we ideally 
want some method of associating all nearby sources be- 
tween two lists rather than merging the lists with some 
specific procedure that uses, for example, positional coin- 
cidence within a small, fixed tolerance to make one associ- 
ation for what is assumed to be the same object in each. 
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-see image ERM 
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* frame set UID 
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^ 
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\/ 



lasSourceRemeasurement 



^ 



lasDetection 
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• filter ID 

* single passband detection 
attributes 



lasSource 

* source UID 

* RA, Dec 

* spatial index (HTM) attributes 

* proper motion attributes 

* merged classification attributes 

* overlap duplicate flag(s) 

* Point and extended source colours 

* Pfiotometric and morpfiological 
attributes x no. of passbands 
copied in from lasDetection 



Figure 6. Relational model for UKIDSS LAS catalogue data in the WSA, following on from the general case in Figure[3]and discussed 
in Section [2X2I 



Positional errors are non-linearly dependent on brightness; 
stellar positions change with time due to proper motion; 
some usage modes may require nearby sources, as opposed 
to the nearest or coincident source in two datasets. For 
these reasons, the WSA follows the SDSS system of defin- 
ing neighbour tables when joining any two datasets where 
the scientifically useful neighbourhood around any given 
object is defined by a maximum angular radius. The gen- 
eralised relational model of neighbour entities is shown in 
Figure [S] Every WFCAM Source may have one or more 
cross-neighbours recorded in XNeighbours (one entity for 
each cross-correlated ExternalSource is required) . An anal- 
ogous relationship exists between the cross-neighbour entity 
and the external source entity, i.e. a many-to-one relation- 
ship, optional on the side of ExternalSource, since once 
again every externally catalogued source may be a neigh- 
bour of one or more WFCAM catalogued objects in Source. 



Figure|8]also models entity Neighbours which is related 
to Source only. This is a neighbour table; it is analogous to 
entity XNeighbours, but it records neighbours within Source 
for every object recorded in the same entity. Hence, two 
optional one-to-many relationships exist between Source 
and Neighbours since every Source may have one or more 
Neighbours while at the same time every Source may be 
a neighbour of one or more other Sources. The concept of 
neighbour tables is discussed in more detail in Section 13.4.41 
with specific examples, and usage modes are illustrated in 
Section [l] 



2.4-1 Synoptic survey data model 

In Figure[3]we illustrate a data model that includes provision 
for a merged source catalogue having a small, fixed number 
of passbands/epoch visits via entity Source. For example, 
the UKIDSS LAS Source prescription is for visits in YJHK 
with a second epoch in J. Modern imaging surveys, however, 
increasingly aspire to extensive sampling of the time domain 
(e.g. Pan-STARRS, Kaiser 2004; GAIA, Ferryman 2005; 
LSST, Claver 2004), and we note that both WFCAM and 
VISTA synoptic infrared surveys are being undertaken. Such 
surveys, which have an indefinite and large number of field 
revisits, require modifications to the data model presented in 
Figure [S] Figure |9] shows a single-passband synoptic survey 
data model, where we have imaging MultiframeDetectors 
giving rise to one or more Detections as before. Neighbour 
entity DetectionNeighbours provides links between each 
detection and all other detections of that same object in 
each case. 

The basic relational design for synoptic survey data 
illustrated here is appropriate for a single-passband tran- 
sit survey. However, it has a number of disadvantages, 
including a large level of repeated associations in the 
DetectionNeighbours entity. For N visits in a given field, 
there will be at least A^ x (A''— 1) rows in the neighbour table 
for every source since every combination of the N detections 
taken two at a time is listed. Moreover, if the survey de- 
sign is multicolour in M passbands, DetectionNeighbours 
would rapidly becomes unmanageable as every combina- 
tion oi N X M taken two at a time is recorded, yielding 
NM{NM - 1) ^ N^M^ entries for every source. This is 
addressed in the revised data model for VISTA synoptic 
surveys presented in Cross et al. (2007). 
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Figure 5. Relational model for WFCAM and external survey 
catalogue metadata entities, and joins between them (see text). 
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Figure 4. Relational model for WFCAM surveys and pro- 
grammes in the WSA (see text). 



3 IMPLEMENTATION 

The relational data models presented previously are 
amenable to implementation in any RDBMS. The WSA 
is deployed in a commercial software product, Microsoft 
'SQL Server', a system that is suitable for medium to large- 
scale applications (this choice was made not least because 
the SDSS Sky Server catalogue access systems are deployed 
on the same - see Thakar et al. 2003b). The implementa- 
tion of the ERMs yields a set of database 'objects' known 
as a schema. The database objects mainly consist of tables, 
where each entity identified previously maps to a table in 



CurrentAstrometry 

* WCS keys 

* RA at centre 

* Dec at centre 

* celestial PA 

* spatial index 



T 



MultiframeDetector 

- see image ERM 



Figure 7. Relational model for astrometric calibration data in 
the WSA. Entities and attributes are included to allow for recal- 
ibration. 
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Figure 8. Relational model for neighbouring sources within a 
WFCAM source table, and between that table and an externally 
derived source list (e.g. an optical survey). 
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Figure 9. Relational model for single— passband synoptic survey 
data in the WSA, discussed in Section 12.4.71 



the schema. These tables hold the astronomical information 
(amongst other data) and can be queried via the WSA user 
interface applications. 

The WSA provide^ a schema browser which gives ex- 
tensive information on the objects (most notably the ta- 
bles) in all available databases. The schema browser initially 
presents the user with a tree-view of databases that are 
held in the archive. Expanding any one database item yields 
a sub-tree of objects (also expandable) that includes the 
items described below. 



3.1 Tables and indexes 

These browser entries are the primary source of astronomi- 
cal information for users. Table names are self-explanatory 
and indicative of their associated data model entities pre- 
sented previously (e.g. dxsSource, gcsSource, lasSource 
hold merged multi-colour source entries as modelled in Fig- 
ure E] for the UKIDSS DXS, GCS and LAS respectively). 
Clicking on any table name yields a full description of the ta- 
ble and its columns, including attribute names, data types, 
units and default values. Further information is available 
for some attributes (those having small icons) that link to 
brief 'tool-tip' style pop-up windows and glossary entries 
that provide more detailed information (e.g. for standard 
pipeline processing catalogue attributes, a summary of rele- 
vant algorithmic details is available - see, for example, those 
for gauSig, aperFluxl and class etc. in lasDetection). Fi- 
nally, a small but nonetheless important detail is that some 
attributes in a table's list of columns have highlighted back- 
ground colours in the browser. This indicates that an index 
exists in the RDBMS for that attribute: execution of queries 
predicated on indexed quantities is very efficient. 



3.2 Views 

Views are simply definitions of tabular sets of data derived 
from the tables available in the database, and can be queried 
in the same way as those tables. A view may be a subset of 
a single table (i.e. a subsample of the rows and/or columns 
available) or a superset of several tables. Views enhance the 
schema over and above the set of tables without incurring 
any storage penalty in the RDBMS system since the un- 
derlying tables are accessed at query time for the defined 
view row/column set. As far as the user is concerned, a view 
is simply a convenient way of accessing, via a single short 
name, a set of data formed from a selection made from one 
or more other database objects (normally tables). In the 
WSA schema browser, expanding the view tree of a given 
database produces the list of available views defined within 
it; clicking on a given item produces a description and the 
formal (SQL) definition of the view. Examples of views in 
recent UKIDSS database releases are: 

• lasPointSource - a subsample of lasSource rows con- 
taining point-like sources in the UKIDSS LAS; 

• lasYJHKMergeLog - a subsample of lasMergeLog rows 
containing frame sets with complete YJHK filter coverage 
in the UKIDSS LAS; 



|http : //surveys . roe . ac ■ uk/«sa/M«w/«sa_bro«ser . html | 
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• lasYJHKSource - a subsample of lasSource rows con- 
taining objects in areas with complete YJHK filter coverage 
in the LAS. 

Other views are defined for the UKIDSS databases, for ex- 
ample views that select samples trading off completeness 
versus reliability - consult the schema browser for more de- 
tails. The view definitions also serve as examples illustrating 
the SQL syntax required to make a specified selection (but 
more of this later) . 

3.3 Functions 

Some useful astronomical functions are provided in certain 
WSA databases, and these are listed in the browser tree- 
view under 'Functions' where available. Functions gener- 
ally take as arguments an attribute name list: for example, 
functions are provided to convert RA and Dec expressed 
in decimal degrees into a more conventional sexagesimal 
string. Other functions include spherical astronomy rou- 
tines (e.g. computation of great-circle distance between two 
points on the celestial sphere) and utility functions to for- 
mat standardised lAU names for arbitrary sources based on 
equatorial spherical co-ordinates. Once again, for more de- 
tails see the schema browser. 



3.4 Data manipulation: curation procedures 

The WSA design incorporates a set of curation applica- 
tion procedures for the creation of science-ready database 
releases for users. Curation procedures include transfer 
of pipeline-processed data, ingest of those data into the 
DBMS, production of quick-look images for browsing, and 
source merging. In this Section we give details of the most 
important procedures from the point of view of the end-user. 



3.4-1 Quality control 

The design of the WSA includes provision of features to 
enable general quality control (QC) of ingested data. Such 
features as a deprecation code attribute in every table sub- 
ject to ingest modification, and expurgation of deprecated 
data in final released database products are provided. Gen- 
eral QC is necessarily a rather open-ended problem requir- 
ing much interaction with the data, at least in the initial 
stages of survey operations. Although the WSA design does 
not preclude fully automated QC procedures, presently the 
UKIDSS data (for example) have a lengthy semi-automated 
QC process applied, some details of which are given in Dye 
et al. (2006) and Warren et al. (2007a). Table [2] provides de- 
tails of the QC checks applied to UKIDSS data as they stand 
at the time of writing. Note, however, that for UKIDSS re- 
leased database products all deprecated data are removed, 
so users will see only those data records having attribute 
deprecated=0. Presently, none of the above QC procedures 
are applied to non-survey data held in the WSA. 

Furthermore, the WSA includes provision for quality bit 
fiagging of catalogue records in common with error condi- 
tion flagging in similar survey projects and source extraction 
pipelines, e.g. SDSS (Stoughton et al. 2002), S-Extractor 
(Bertin & Arnouts 1996) and SuperCOSMOS Sky Survey 
source extraction (e.g. Hambly et al. 2001b). This procedure 



consists of the assignment of single bits to represent Boolean 
true/false conditions in an integer attribute modified during 
source extraction and/or post-processing of the extracted 
catalogues. The WSA data model includes provision for 
both, and Table |3] gives details of the post-processing qual- 
ity error bit fiags currently defined. Following Hambly et 
al. (2001b) and references therein, the philosophy is to use 
more significant bits in the flag for more severe quality error 
conditions. Hence the numerical value of the quality flag can 
be used as a measure of the relative quality of that catalogue 
record: the higher the quality error value, the more likely it 
is that the record is spurious. Of course, individual quality 
bits can be tested also to see if a given condition is true for 
a catalogue record - this is achieved using the appropriate 
bit mask (expressed in hexadecimal in Table O). 

3.4-2 Source merging 

Combining single passband and/or single epoch detections 
into a merged multi-colour, multi-epoch record is one of the 
major curation activities applied after ingestion of pipeline 
processed catalogues. The merging philosophy is based on a 
number of fundamental assumptions that are made in order 
to provide a procedure that is scalable to billions of individ- 
ual object records. Primarily, source merging is based on the 
concept of frame sets (e.g. Sections 12.4.21 and I2.4.4P where 
the individual passband/epoch detections to be merged are 
assumed to come from a set of well aligned frames. This has 
the major advantage that given any one detection, the cor- 
responding detection in another fllter or at another epoch is 
easily and quickly locatable in a tiny subset of all available 
detections over all frames since the procedure is restricted 
in its search to one specific frame. One of the disadvan- 
tages is that if a survey area is tiled differently between 
the various passband and epoch visits made, then this as- 
sumption is invalid and unmerged detections will appear in 
the final source list. Another less critical assumption is that 
a small subset of individual passband/epoch detection at- 
tributes is propagated into the source table for each merged 
source. This subset includes what is considered to be the 
most useful subset of photometric, astrometric and morpho- 
logical attributes along with associated errors, and currently 
includes a selection of four fixed aperture and Petrosian 
flux measures, model profile flux estimators, individual pass- 
band/epoch morphological classiflcations and image quality 
attributes. Note, however, that all detection attributes are 
always available in the detection tables; propagating a few of 
those more commonly used simply makes end-user querying 
easier and faster. 

In addition to propagating individual detection at- 
tributes, the source merging procedure computes new at- 
tributes. For example, default point and extended source 
colours and associated errors are calculated, in pair combina- 
tions of fllters adjacent in wavelength (e.g. for the UKIDSS 
LAS YJHK data, colours Y-J, J-H and H-K are computed). 
Also, a normally distributed merged classification statis- 
tic and associated discrete classification code are calculated 
using the available individual passband/epoch values. The 
standard 80-parameter detection attributes in the catalogue 
extraction software (Irwin et al. 2007) include a normally 
distributed, zero mean, unit variance statistic derived from 
the radial profile of each detected object. This N(0,1) statis- 
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Table 2. WSA quality control deprecation codes and their meaning. 



Deprecation Description 
code 



1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 
15 
16 
18 
19 

20 
21 
22 
26 
40 
60 
61 
62 
63 
64 
65 
66 
70 



99 
100 
101 

102 

> 127 

255 



Stack frames that have no catalogue 

Dead detector frames or all channels bad 

Undefined and or nonsensical critical image metadata attributes 

Poor sky subtraction (via pipeline sky subtraction scale factor) 

Incorrect combination of exposure time/number/integrations for survey specific projects 

Incorrect frame complements within groups/nights (for incomplete observing 'blocks') 

Undefined values of critical catalogue attributes for stacks 

Seeing=0.0 for a stack 

High value of sky that compromises the depth 

Seeing outside specified maximum 

Photometric zeropoint too bright 

Average stellar ellipticity too high 

Depth (as calculated from sky noise and 5(7 detection in fixed aperture) is too shallow compared to overall 

histogram distribution (i.e. shallower than 0.5mag wrt the modal value) or sky noise is too high for sky level 

Default aperture correction outlying in distribution of same versus seeing 

Pipeline photometric zeropoint inconsistent between image, extension and/or catalogue extension keywords 

Difference in detector sky level wrt to mean of all 4 detectors is outlying in the distribution of the same 

Provenance indicates that a constituent frame of a combined frame product includes a deprecated frame 

Inconsistent provenance for a stack or interleaved frame indicating something wrong with the image product (usually 

corrupted FITS keywords confusing the pipeline) 

Detector number counts indicate some problem, e.g. many spurious detections 

5(7 depth of detector frame more than 0.4mag brighter than modal value for a given filter /project/exposure time 

Astrometry check (pixel size and/or aspect ratio) indicates something is wrong with the image 

Deprecated because frame is flagged as ignored in pipeline processing 

Science (stack) frame is not part of a survey (e.g. high latitude sky frames in the GPS) 

Eyeball check deprecation: trailed 

Eyeball check deprecation: multiple bad channels 

Eyeball check deprecation: Moon ghost 

Eyeball check deprecation: Sky subtraction problem 

Eyeball check deprecation: Disaster (catchall category for the indescribable) 

Eyeball check deprecation: Empty detector frame 

Flat fielding problem 

Eyeball check requires deprecation, but this is the best that can be done so this should not be reobserved 

(e.g. very bright star in WFCAM field of view) 

Deprecated because observation (block, object, filter) has been repeated later (shallow surveys only). The latest 

duplication in each case is kept 

Manually deprecated because of some data flow system issue (e.g. pipeline malfunction) 

Multiframe deprecated because all detectors have been previously deprecated 

MultiframeDetector deprecated because parent Multiframe is deprecated 

Detection deprecated because parent Multiframe Detector deprecated 

Deprecated because pipeline reprocessing supersedes it (where value = 128+ deprecation code as defined above) 

Deprecated database-driven product (e.g. deep stack) 



tic describes how point-like each object is with respect to 
an empirically-derived, idealised radial profile set represent- 
ing the PSF for the frame. A value of 0.0 indicates ideally 
point-like, increasingly negative values indicate sharper im- 
ages (e.g. noise-like), and increasingly positive values indi- 
cate extended (e.g. resolved galaxies). Because the statistic 
is normalised over the full magnitude range of the data to 
the N(0,1) form, a selection between ±2.0, regardless of mag- 
nitude, will yield a sample notionally complete to 95% for 
example. For merged sources, a merged classification statis- 
tic is computed amongst those available from the individual 
passband detections. This is computed as the sum of those 
available, n, divided by y'n, noting that the result of av- 
eraging n individual zero mean, unit variance - i.e. N(0,1) 
- statistics results in a distribution of RMS 1/^n; hence 
rescaling the average by ^n - or, equivalently, dividing the 
sum by ^Jn - results in a combined statistic that is also 



N(0,1). Where a given passband and/or detection is un- 
available, or where calculation of merged attributes is not 
possible, default values (Section 12.3. l|l are used to populate 
the fields of records affected. A complete description of the 
attributes in each merged source list is available online at 
the WSA via the schema browser (see Section O. 

At the core of the WSA source merging procedure there 
is an efficient pairing algorithm which associates detections 
between a given pair of passbands/epochs based on prox- 
imity within a matching tolerance, or pairing criterion. Ta- 
ble |4] gives the radial pairing criteria currently employed 
in UKIDSS source merging (these values are stored in the 
database in table Prograimne, attribute pairingCriterion 
for every survey and non-survey programme that requires 
source merging). Note that these tolerances are large com- 
pared with the typical astrometric errors (~ 0.1") to allow, 
for example, for pairing of moving sources and very faint 
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Table 3. Post-processing error quality bit flags currently assigned in the WSA curation procedure for survey data. From least to 
most significant byte in the 4— byte integer attribute (ppErrBits; see later), byte (bits to 7) corresponds to information on generally 
innocuous conditions that are nonetheless potentially significant as regards the integrity of that detection; byte 1 (bits 8 to 15) corresponds 
to warnings; byte 2 (bits 16 to 23) corresponds to important warnings; and finally byte 3 (bits 24 to 31) corresponds to severe warnings. In 
this way, the higher the error quality bit flag value, the more likely it is that the detection is spurious. The decimal threshold (column 4) 
gives the minimum value of the quality flag for a detection having the given condition (since other bits in the flag may be set also) . The 
corresponding hexadecimal value, where each digit corresponds to 4 bits in the flag, can be easier to compute when writing SQL queries 
to test for a given condition (see later). 



Byte Bit Detection quality issue 



Decimal 
threshold 



Hexadecimal 
bit mask 












2 





4 





6 


2 


16 


2 


19 


2 


22 



Close to a dither edge (not yet implemented) 

Near to a bright star (not yet implemented) 

Deblended 

Bad pixel (s) in default aperture 

Close to saturated 

Possible crosstalk artefact/contamination 

Within dither offset of image boundary 



1 


0x00000001 


4 


0x00000004 


16 


0x00000010 


64 


0x00000040 


65536 


0x00010000 


524288 


0x00080000 


4194304 


0x00400000 



Table 4. WSA radial pairing tolerances used in UKIDSS source 
merging. 



Survey 


Radial pairing 




criterion (arcsec) 


LAS 


2.0 


GPS 


1.0 


GCS 


2.0 


DXS 


1.0 


UDS 


1.0 



sources with larger centroiding errors. Positional offset at- 
tributes for each filter/epoch pass are propagated into the 
merged source tables to allow filtering of the merged source 
list at query time if a tighter pairing criterion is required 
(see later). Once again, scalability becomes a major issue in 
a computationally expensive procedure like record matching. 
The WSA philosophy necessarily requires a compromise be- 
tween speed and 100% accurate source association for real 
data (with all its vagaries) in every conceivable situation. 
Figure [To] illustrates the straightforward scenario where two 
passes over the same area of sky are source merged. In order 
to correctly identify the nearest match in each case, the pair- 
ing procedure creates a set of pointers from set 1 as master 
to set 2 as slave, and in reverse from set 2 as master to set 1 
as slave. Then, a 'hand-shaking' run through the two sets of 
pointers is used to associate only those matches that agree 
on each other being the nearest match. This forward/reverse 
pairing and handshaking between any two detection sets 
from different passes helps to reduce spurious matches to 
a minimum - case (c) in Figure 1101 case (a) in Figure [11] 
- at the same time requiring only two passes through the 
datasets. 

Of course, this approach has its limitations. In Figure fTTI 
we illustrate a few relatively rare or pathological cases where 
the pairing algorithm will fail. However, we note that in 
cases where pairing fails, unpaired records will be propa- 
gated into the merged source lists as single passband de- 
tections and the end-user always has at their disposal the 
flexibility provided by the neighbour table (Section I3.4.4P 
to associate unmatched records of the same source using a 



more sophisticated algorithm that is appropriate to the par- 
ticular science application. Clearly it is better to minimise 
spurious pairings with an efficient algorithm than to attempt 
to match every last record correctly with an impractically 
time consuming process and at the same time risking incor- 
rect matches. In this respect, the core pairing algorithm in 
the WSA is conservative. 

Given a frame set of filter/epoch passes, source merging 
proceeds by taking each combination in pairs (e.g. for a sin- 
gle epoch ZYJHK set, Z would be hand-shake paired with Y, 
J, H and K; Y with J, H and K; J with H and K; and finally H 
with K) in order to enable merging of sources even when they 
are detected in as few as any two passes (note that epoch 
passes are treated in exactly the same way as different filter 
passes). Lastly, the full set of pointers is worked through, 
and merged sources created using the pointer associations. 
Each detection in each frame in the set is propagated once, 
and once only, into the merged source list, either as part of 
a merged record or on its own as a single passband detec- 
tion. Offsets in local tangent plane co-ordinates are stored in 
the merged source list; these quantify the distance between 
the pairings, the shortest wavelength considered as the ref- 
erence position in each case. In the the single epoch ZYJHK 
example above, handshake pairs between Z as reference and 
YJHK as 'slave' are propagated into the merged source list 
first, with offsets from the Z position stored in attributes 
jXi, jEta, hXi, hEta, etc. Then any remaining Y detections 
would be considered as reference for JHK slaves, etc. 

The combination of i) a relatively large radial pairing 
criterion, ii) handshake pairing, and iii) storage of offset val- 
ues between pairs provides maximum flexibility for the end 
user. The large pairing radius maximises the chances of mov- 
ing objects or objects with large centroiding errors being 
paired. At the same time, the handshaking procedure min- 
imises spurious pairings in ambiguous situations and forces 
nearest neighbour matches to be chosen always. Finally, the 
availability of the pairing offsets in the merged source list 
enables the end-user to 'tune' the pairing radius at query 
time - limiting pairing offsets can be specified to a maxi- 
mum allowed by the radial pairing tolerance, as appropriate 
to the science application (see later). 

Finally, the WSA merged source procedure has a 'seam- 
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Figure 10. Pretend catalogue data illustrating the core pairing algorithm between two filter/epoch pass sets in a small area: (a) close, 
but well separated objects paired lal=2al, Ia2=2a2 etc.; (b) isolated moving object lbl=2bl; (c) differently deblended objects, where 
lcl=2cl, 2c2 remains unpaired since although Icl is within the pairing tolerance of 2c2 when set 2 is master, when set 1 is master 2cl 
is closest to Icl and hence lcl=2c2 fails at the hand— shaking stage (see text for more details). 
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Figure 11. As Figure [TOl but illustrating some limitations of the current WSA pairing algorithm: (a) very high proper motion star (1) 
moves past another object detected in both sets - Ia2=2a2 satisfies hand-shake pairing and is paired, but lal points to 2a2 and 2al 
points to la2 so the fast moving object is not correctly matched, fails hand-shake pairing and remains unpaired; (b) very high proper 
motion object moves past an object detected only in set 2 - lbl=2b2 satisfies hand— shake pairing and is incorrectly matched, while 2bl 
remains unpaired. 



ing' feature that enables selection of a science-ready merged 
source sample. AH imaging surveys have some degree of over- 
lap between adjacent fields, perhaps by design (to enable 
cross-calibration for example) or because of instrument de- 
sign or guide star limitations. The WFCAM focal plane ar- 
ray (Casali et al. 2007), consisting of 2 x 2 detectors spaced 
by ~ 95% of the detector width, automatically produces 
overlap regions in survey areas tiled for contiguous cover- 
age. Moreover, at high Galactic latitudes in particular, guide 
star limitations can result in overlap regions of increased 



size. Because repeat measurements of the same objects pro- 
vide scientifically useful information, the WSA philosophy 
is to retain duplicates in the merged source lists, noting by 
means of an attribute flag (see Section [Jjl when a particular 
source has duplicates present, and if so, which measurement 
is considered to be the 'best'. A source is considered to be 
duplicated when an adjacent frame set contains a source 
within 0.8" using the same pairing/handshaking procedure 
described earlier. Briefly, the decision logic behind the choice 
of the best source examines each set of duplicates (there may 
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be two or more to choose between) on a source by source 
basis. Source records having the most complete passband 
coverage are favoured primarily; when two or more source 
records all have the same number of passband measures, the 
choice of primary source is based on position relative to the 
edges of the corresponding image (detections furthest from 
the edges are favoured) amongst the set of duplicates having 
the fewest quality error bit flags set (Section 13. 4. ip . 

3.4-3 Enhanced image products 

Within UKIDSS, the DXS and UDS include image data in 
the same pointing and same filter that are taken over many 
observing blocks on the same or different nights. Thus it is 
necessary to stack these data at the archive to produce final 
image products of the required depth. Cataloguing of these 
deep image stacks is also performed at the archive end. In 
the case of the UDS, the cataloguing is performed on mosaics 
made up of the 4 pointings so that objects at the boundaries 
of each pointing are measured at the full depth of the survey 
and are not broken up into pieces. 

The DXS uses the same stacking and cataloguing code 
used in nightly pipeline processing of the shallow surveys 
(Irwin et al. 2007) but the UDS images have been stacked 
and mosaiced by the UDS team (e.g. Foucaud et al. 2007) 
using the Terapix software SWARP (Bertin et al. 2002) , and 
we have used Source Extractor (Bertin & Arnouts 1996) to 
catalogue the UDS deep mosaics. Only those intermediate 
stack images (i.e. the stack products of individual observing 
blocks) that pass standard survey quality control (e.g. Sec- 
tion [SAl} are included in deep stacks/mosaics in the WSA. 

3.4--4 Neighbour/cross-neighbour catalogue joins 



As described in Section [2]4]6] the concept of a neighbour ta- 
ble provides a generalised cross-matching facility that can 
service diverse usage modes. The WSA philosophy is to pro- 
vide neighbour tables for each merged source table, in or- 
der to allow, for example, easy and quick internal consis- 
tency checks on calibration. Furthermore, cross-neighbour 
tables are provided between UKIDSS source tables and a 
selection of other large external survey datasets, again to 
facilitate rapid cross-matched astronomical usage modes. 
We note that the generic problem of cross-matching very 
large datasets (i.e. those containing ^ billions of rows) is 
receiving attention in the burgeoning Virtual Observatory 
(e.g. O'MuUane et al. 2005 and references therein); the 
WSA currently holds local copies of user-required external 
datasets (e.g. SDSS catalogue data releases, the 2MASS cat- 
alogues) in lieu of fast VO-implemented solutions. As far as 
a scalable implementation is concerned, the WSA employs 
bulk data egress/ingest facilities provided in the back-end 
RDBMS, and an application making use of the 'plane sweep' 
algorithm (Devereux et al. 2005) for extremely fast cross- 
matching. 

Further details concerning neighbour tables, the exter- 
nal datasets held in the WSA and corresponding cross- 
neighbour tables are given online in the schema browser 
(Section [3|. For example, a cross-matching neighbourhood 
radius of 10" is used generally although this varies depend- 
ing on the tables being matched. Illustrative usage examples 
are given below. 



4 ILLUSTRATIVE SCIENCE EXAMPLES 

Appendix|X]lists some typical archive usage modes that were 
identified in collaboration with the user community (i.e. the 
UKIDSS consortium) early on in the WSA design phase. 
For casual browsing and usage involving limited data sub- 
sets or very small areas of sky, the static web forms pro- 
vided in the WSA user interfac4j are sufficient to give the 
user the required data retrieval functionality. However, for 
large-scale (e.g. large area) and/or complex (e.g. wholesale 
statistical analysis) usage modes such as those illustrated in 
Appendix [X] the provision to the user of a highly flexible 
interface is necessary. The WSA design philosophy is to ex- 
pose the Structured Query Language (SQL) interface of the 
underlying RDBMS to the user to provide the required flexi- 
bility. Allowing users to execute data selections, calculations 
and statistical computations on a machine co-located with 
the data (i.e. 'server-side', or on the computer that hosts 
the RDBMS itself) allows many users to access the large 
data volume without recourse to wholesale distribution of 
the entire data set. 

A free-form SQL interface is provideqj in the WSA in- 
terface, and the example scripts below can be input directly 
once a user is logged in and/or and appropriate database re- 
lease has been selected. Options within the interface include 
upload of a script flle in addition to direct typing or cut-and- 
paste. Note that the WSA free-form SQL interface imposes 
the following limits on individual queries: maximum execu- 
tion time 4800 seconds; output rowsx columns = 15 x 10^ 
(i.e. more attribute columns selected implies fewer rows al- 
lowed in the results file). These limits are imposed to pre- 
vent inexperienced users locking up the service with erro- 
neous and/or inefficient queries. No limit is currently made 
on the number of concurrent queries or the frequency with 
which they can be submitted. Output formats include plain 
comma-separated text, FITS binary table and VOTablqj, 
an XML format designed for international Virtual Observa- 
tory (VO) initiatives. 

At the time of writing, other interface options are un- 
der development; furthermore, the WSA is in the process 
of being integrated into the VO via deployment of infras- 
tructure developed by the AstroGrid project (e.g. Walton et 
al. 2006). In particular, UKIDSS database releases are pub- 
lished to the VO using the AstroGrid Data Set Access (DSA) 
software. This has several advantages, (i) The database is 
visible in VO resource registries around the world, and so 
turns up in searches for databases of this kind, (ii) The meta- 
data describing the database (column names, unified con- 
tent descriptors, table structure) are available through any 
VO-compatible software, (iii) Our database accepts queries 
in the IVOA standard query format. Astronomical Data 
Query Language (ADQL). This means that generic query 
software, such as the AstroGrid Query Builder, can be used 
to issue queries to UKIDSS data, (iv) Our database under- 
stands calls coming from libraries of routines in the "Astro 
Runtime", so that for example, programmable use of the 
database can be made using high-level languages such as 
Python. 



^ 'http : //surveys . roe . ac . uk/wsa/dbaccess . html ' 

* http : //surveys . roe .ac.uk: 8080/wsa/SQL_f orm . j sp] 

^ |http ://«««. ivoa.net/Documents/latest/VdT .html I 
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4.1 Guidance for the use of SQL in the WSA 

In Appendix|B]we give a brief introduction to the fundamen- 
tals of SQL data retrieval (SELECT) statements. A more com- 
prehensive guide is given onlinqlj in the WSA 'SQL cook- 
book', but in this Section we give brief guidance to avoid 
common mistakes and to get the most from the system. 



4.1.1 Use COUNT (*) and TOP N 

A good way of checking that a query is sensible is to re- 
place the attribute selection list with COUNT(+) since this 
skips creation of an output file (including any DBMS look- 
up stage which can be time consuming for large row counts) 
and can indicate if something is badly wrong in a query 
(e.g. an incorrectly specified table join). Consider query B8 
in Appendix [B] where a list of UKIDSS programmes/filters 
is required: 

SELECT COUNT (*) 

FROM Programme AS tl, RequiredFilters AS t2 

/* NB: this is not a good query */ 

returns a count of 642 which is clearly wrong since there are 
five UKIDSS programmes with on average ~ 4 filter cover- 
age per programme - we would expect a count of ~ 20. As 
noted in Appendix |Bj the related rows in the tables need to 
be explicitly filtered using the referencing attribute common 
to both - in this case, the unique identifier prograiranelD: 

SELECT COUNT (*) 

FROM Prograirane AS tl, RequiredFilters AS t2 

WHERE tl .programmelD = t2 .programmelD 

returns a much more reasonable figure of 22 for UKIDSS 
DR2. Note that summary counts for various survey release 
tables are available online on the WSA web pages. Further- 
more, data analysis plots showing the density of stars and 
galaxies in colour space are also available - these can be 
helpful when searching for rare objects in sparsely popu- 
lated colour ranges. 

Note that another useful SQL command is TOP when de- 
bugging queries. For example, SELECT TOP 10 . . . FROM . . . 
will simply give the first ten rows that satisfy the query and 
then execution will stop. The reduced results set can be in- 
spected for appropriateness and/or errors before running the 
same query again without TOP 10. 



4.1.2 Use GROUP BY for counts in arbitrary bins 

Following on from the use of COUNT (*), the addition of 
GROUP BY (and furthermore statistical aggregates like AVG() 
for means, MIN() and MAXO for minimum and maximum 
etc. - see Appendix [B| is very useful for summarising the 
contents of a selection and/or binning up data with a single 
pass through the table. For example, what are the source 
counts in Galactic longitude slices in the UKIDSS GPS? Do 
not use 

SELECT COUNT (*) 

FROM gpsSource 

WHERE 1 BETWEEN 0.0 AND 1.0 



' |http : //surveys . roe . ac . uk/wsa/sqlcookbook ■ html | 



and then another query 

WHERE 1 BETWEEN 1.0 AND 2.0 

and so on. It is much easier and much more efficient to use 
GROUP BY to bin up in slices defined by longitude rounded 
to the nearest degree, for example; 

SELECT CAST (ROUND (1,0) AS INT) AS longitude, 

COUNT (*) 
FROM gpsSource 
GROUP BY CAST(R0UND(1,0) AS INT) 
ORDER BY CAST (ROUND (1,0) AS INT) 



The query in Section [4.2.41 below illustrates this further for 
the real survey data; for details of SQL functions like CAST 
and ROUND consult the WSA online documentation or any 
standard text on SQL. 



4-. 1.3 Take great care when joining tables 

Following on from checking using COUNT (*) as illustrated 
above, in general follow these simple rules when employing 
implicit table joins (i.e. when supplying comma-separated 
lists of tables in a FROM clause) : 

• for a list of A^ tables, ensure there are at least A'' — 1 
WHERE conditions associating related rows in those tables; 

• never attempt spatial joins on co-ordinates (e.g. the 
query SELECT ... FROM lasSource AS s, lasDETECTION 
AS d with an attempted joining clause of WHERE s . ra=d . ra 
AND s.dec=d.dec is inadvisable from many standpoints in 
addition to being dreadfully inefficient); 

• always use the relational unique identifiers (i.e. primary 
keys) that associate related rows in related tables. 

For example, suppose a GPS user requires a source se- 
lection including an attribute that is not available in the 
source table, e.g. the modified Julian date of the J obser- 
vation and the isophotal magnitude in H. The relational 
model detailed previously shows that the related tables are 
gpsSource, gpsMergeLog, gpsDetection and Multiframe, 
since every merged source belongs to a frame set recorded 
in gpsMergeLog and consists of detections recorded in 
gpsDetection arising from frames recorded in Multiframe. 
Examination of the arrangement of the UKIDSS data via the 
schema browser (Section[3l) identifies tables Multiframe and 
gpsDetection as containing the relevant attributes mjdObs 
and isoMag respectively. Clearly, these four tables must ap- 
pear in the FROM clause of the query and it is vital to include 
filters in the WHERE clause to associate the related rows: 

SELECT TOP 10 s.sourcelD, s.ra, s.dec, 

m.mjdObs AS jmjd, d.isoMag AS hIsoMag 
FROM gpsSource AS s, gpsMergeLog AS 1, 

gpsDetection AS d, Multiframe AS m 
WHERE 
/* Associate each source with its frame set: */ 

s.frameSetID = l.frameSetID AND 
/* Pick out the H band detection: */ 

l.hmfID = d.multiframelD AND 

l.heNum = d.extNum AND 

s.hseqNum = d.seqNum AND 
/* Pick out the J band frame : */ 

l.jmfID = m.multiframelD AND 
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I* Keep only sources having J and H: */ 
l.jmfID > AND l.hmfID > 

Note that the cross-referencing attributes are all defined as 
primary keys in the referenced table entries in the schema 
browser; the RDBMS is extremely efficient in locating rows 
in tables using these. 



able onlinqij in the WSA documentation; further examples 
of WSA SQL queries can be found in Dye et al. (2006) and 
Lodieu et al. (2007a). The following queries are presented 
in order of increasing complexity rather than in the order of 
the usages in Appendix [X] Row counts and execution times 
at the end of the scripts are those for UKIDSS Data Release 
2 when selecting FITS output format (for those queries that 
return many row results sets). 



4-1-4 Use views, especially when new to the data 

There are a number of predefined selections based on var- 
ious optimisations of completeness versus reliability from 
cuts on various morphological parameters available in sur- 
vey database releases in the WSA. Users are advised to check 
the available views (again in the schema browser) when new 
to the WSA survey datasets to see if there are any that suit 
a given astronomy application. For example, there is a view 
of lasSource called reliableLasPointSource, which is a 
predefined selection with cuts on morphological parameters 
and a requirement for detection in Y, J and H for a reliable 
sample of point sources. 



4-1.5 Tune paired/ cross-matched selections appropriately 

When using the merged source tables and/or the neighbour 
tables for cross-matches between tables, users are advised 
to think carefully about the maximum angular distance that 
is applicable to a given astronomy application. The default 
pairing/cross-matching radii are conservative in that they 
are set deliberately large to cover as many applications as 
possible, but they may be too large for a specific case and 
should be limited at query time. For example, attributes 
Xi and Eta are available for each passband in the merged 
source table - if an astronomy application of the GCS does 
not anticipate any pairings outside a 0.5" radius, then the 
following predicates should be included: 

WHERE zXi BETWEEN -0.5 AND +0.5 AND 
zEta BETWEEN -0.5 AND +0.5 AND 
yXi BETWEEN -0.5 ... 

etc., for all passbands as necessary. For the case of cross- 
matched selections employing neighbour tables, an appro- 
priate limit on the neighbourhood radius should be placed 
via a predicate on the attribute distanceMins which is the 
distance in arcminutes between any given 'master' source 
and a 'slave' cross-match in the neighbourhood of the for- 
mer. Further examples of this are given below. 



4-2.1 Candidate Galactic cluster members 

Usage example U3 in Appendix |X] requires candidate cluster 
member selection from the UKIDSS GCS by colour, magni- 
tude and proper motion. Colour selection is straightforward 
in SQL: 

SELECT zAperMag3-jAperMag3 AS zm j , 

zAperHagS AS z 

FROM gcsPointSource 
WHERE 

/* Positional cuts for the Sigma Orionis in the 
Orion Nebula Cluster (in degrees for both) : */ 

ra BETWEEN +84.00 AND +85.00 AND 

dec BETWEEN -2.85 AND -2.30 AND 
/* Magnitude cuts to avoid saturated sources : */ 

zAperMagS > 11.3 AND 

yAperMag3 > 11.5 AND 

jAperMag3 > 11.0 AND 

hAperMag3 > 11.3 AND 

k_lAperMag3 > 9.9 AND 
/* Magnitude/colour cuts to select out the member 
sequence: */ 

zAperMag3 < 5 . 0*(zAperMag3-jAperMag3) + 

10.0 AND 

jAperMag3-hAperMag3 > 0.3 
/* UKIDSS DR2 rows returned: 144 

Execution time: 00m 12s */ 

where the colour/magrutude selection cuts have been de- 
fined by examining colour-magnitude and colour-colour 
plots of the selection made without the final two predicates. 
Figure [TJ] illustrates the results in Z versus Z— J colour- 
magnitude diagrams that clearly show the cluster member 
sequence. At the time of writing, UKIDSS GCS proper mo- 
tions are unavailable because second epoch survey observa- 
tions have yet to start. However, Lodieu et al. (2007a) show 
that, at least for brighter stars, proper motions can be com- 
puted by comparison with 2MASS catalogue positions; see 
also Lodieu et al. (2007c) where this kind of usage is de- 
omstrated for the Pleiades open star cluster in the UKIDSS 
GCS. 



4.2 Example SQL queries for astronomy usages 

In this Section we give a set of astronomy SQL query ex- 
amples that are used as steps in part fulfilment of the us- 
ages in Appendix |^ where in each case, an explanation is 
given and results are illustrated. As noted in Appendix iBl 
the WSA interface is case-insensitive: mixed case is used in 
the examples for clarity in distinguishing SQL keywords and 
database object names. Note also that /*...*/ can be used 
to enclose comments in the scripts; these are ignored by the 
WSA DBMS when the script is run. The scripts are avail- 



4-2.2 Counts of objects that are unpaired between epochs 

Usage example U4 in Appendix |X] includes requirements 
to select a sample of high proper motion stars having to- 
tal proper motion fj, > Sct^, and to count the number of 
sources that are unpaired between the two epochs of the 
UKIDSS LAS J-band imaging. There are a number of ways 
of achieving this, with increasingly sophisticated searches 



^^ |http : //surveys . roe . ac ■ uk/wsa/pubs ■ html | 
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Figure 12. (a) Colour-magnitude plot of the results set from the example query in Section 14.2.11 without the final two predicates 
showing the general field population and a clear brown dwarf cluster sequence in the cr-Orionis cluster; (b) the same plot but using the 
two additional predicates to select out the cluster sequence. 



yielding increasingly reliable candidates (but often at the 
expense of completeness). As a first step, use of the view 
reliableLasPointSource is recommended. For the paired 
high proper motion selection, we note that since 



2 2 2 



(1) 



where ^a and ^s are the components of proper motion (mea- 
sured in the same units) in RA and Dec respectively, and 
combining proper motion component errors in quadrature, 
we have that 



/',,2„2 I ,,2„2 \l/2 



(2) 



Hence, the 5(t condition on total proper motion, /i > 5crp 
becomes 

{A + mI) > 5(m^'^mc. + i^l^lsY^'^- 



',s> ■ (3) 

In SQL, the high proper motion selection statement is 



SELECT COUNT (*) 

FROM reliableLasPointSource 

WHERE SQUARE (muRA) + SQUARE(muDec) > 5.0*SQRT( 

SQUARE (muRA*sigMuRA)+SQUARE(muDec*sigMuDec) 
) 
/* UKIDSS DR2 rows returned: 1 (count=0) 
Execution time: 01m 00s */ 

At the time of writing no second epoch observations have 
been taken for the UKIDSS LAS, so this query returns zero 
in releases up to and including DR2. 

For the count of unpaired objects, use of the view 
lasReliablePointSource is recommended. Examination of 
the available table attributes in the view (see Section [S] the 
attribute list is the same as the base table lasSource from 
which the view is derived) shows first- and second-epoch 
attribute names are prefixed by j_l and j_2 respectively. 



Default values in one or other of the detection unique iden- 
tifiers Ob j ID for a given passband indicate no merged pair in 
that band, so a count of unpaired sources is simply obtained 
via 

SELECT COUNT (*) 

FROM reliableLasPointSource 

WHERE 

/* Specify detection at one epoch only: */ 

(j_10bjID > AND j_20bjID < 0) OR 

(j_10bjID < AND j_20bjID > 0) 
/* UKIDSS DR2 rows returned: 1 (count=827968) 
Execution time: 01m 06s */ 

where the test condition is for a default detection identifier 
value (i.e. no detection) at one or other, but not both, of 
the two epochs. Once again, because no second epoch ob- 
servations are available presently, this query simply returns 
a count of all objects in the view since the definition of 
reliableLasPointSource excludes any object not detected 
at j_l. 



4-2.3 Deep galaxy catalogues 

Usage example U5 in Appendix |X] concerns user-selected 
galaxy catalogues. The following simple SQL example shows 
how to do this for the UKIDSS DXS: 

SELECT ra, dec, 

/* De-reddened Petrosian magnitude and 
fixed aperture colour: */ 
jPetroMag-aj as j, 

(jAperMag3-aj)-(kAperMag3-ak) as jmk 
FROM reliableDxsSource 
WHERE 
/* Classification cut to exclude all point 
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Figure 13. Colour-magnitude diagram in J versus J— K showing 
the results of the query in Section 14.2.31 



sources: */ 

mergedClass NOT BETWEEN -1 AND AND 
/* Exclude any sources with poorly or undefined 
Petrosian mags: */ 

jPetroMagErr BETWEEN AND 0.2 AND 
kPetroMagErr BETWEEN AND 0.2 
/* UKIDSS DR2 rows returned: 142,996 

Execution time: 00m 06s */ 

Here, we use the available view reliableDxsSource to de- 
fine a clean (but necessarily incomplete) selection, excluding 
point-like sources. Several choices are available as regards 
extended source flux measures - see the entry for the base 
table dxsSource in the schema browser (described in Sec- 
tion [3|. In this case, we have chosen the Petrosian appar- 
ent magnitude, dereddened for foreground Galactic extinc- 
tion (Schlegel, Finkbeiner & Davis 1998; Bonifacio, Monai 
& Beers 2000) and fixed 2" diameter apertures for a colour 
index. A colour-magnitude diagram is shown in Figure [T51 
The spatial extent of the deep stacked UKIDSS surveys is 
easily determined in SQL by a number of methods. The 
simplest is illustrated for the UDS as follows: 

SELECT MIN(ra),MAX(ra),MIN(dec),MAX(dec), ( 

(MAX (ra) -MIN (ra) ) *COS (RADIANS ( AVG (dec) ) ) ) * 
(MAX(dec)-MIN(dec) 
) AS area 
FROM udsSource 
/* UKIDSS DR2 rows returned: 1 

Execution time: 00m 03s */ 

This query returns the extent of the UDS in RA and Dec and 
the area covered: 0.89 square degrees (more sophisticated 
examples concerning areal coverage information are given 
in Section I4.2.6|l . As a further example of galaxy catalogue 
selection, consider the following query: 




Kab 
Figure 14. Galaxy number— magnitude counts in the UDS from 
the final query in Section 14.2.31 (cf. Figure 2 of Lane et al. 2007). 



SELECT CAST (ROUND (kab*2. 0,0) AS INT)/2.0 AS K_AB, 

LOG10(COUNT(*)/0.89) AS logN 
FROM ( 

SELECT (kPetroMag-ak) +1.900 AS kab 
FROM udsSource 

WHERE mergedClass NOT BETWEEN -1 AND AND 
jPetroMag > 0.0 AND 
kPetroMag > 0.0 
) AS T 
GROUP BY CAST (ROUND (kab*2. 0,0) AS INT)/2.0 
ORDER BY CAST (ROUND (kab*2. 0,0) AS INT)/2.0 
/* UKIDSS DR2 rows returned: 42 

Execution time: 00m 03s */ 

This consists of a nested subquery to select UDS galaxy cat- 
alogue Kab magnitudes via some simple predicates (note 
that Vega-to-AB magnitude conversion constants are pro- 
vided for each WFCAM passband in table Filter). The 
outer query uses a combination of SQL functions to bin up 
counts of the number of galaxies in 0.5 magnitude bins via 
grouping (see Appendix |Bj within the appropriate ranges. 
The results are plotted in Figure [T^ and are in good agree- 
ments with similar counts in Figure 2 of Lane et al. (2007) 
at the faint end where galaxies dominate over stars in the 
counts. 



4.24 Star counts m cells m the UKIDSS GPS 

One of the (many) advantages to the availability of a flexible 
SQL interface in the WSA is that it allows the user to make 
summaries of the data held without recourse to downloading 
entire source catalogues. For example, in the UKIDSS DR2 
the GPS merged source table contains 3.6 x 10* rows; with 
a row length of ~ 1 kilobyte the DR2 GPS merged source 
catalogue is over one third of a terabyte in size. Usage exam- 
ple U8 in Appendix |X] shows a typical example where star 
counts in cells (in this case in spherical polar co-ordinate 
space) are required as a broad-brush summary of the cata- 
logue. SQL provides several functions that make counts in 
bins in arbitrary parameter space relatively straightforward: 

SELECT CAST (ROUND (1*6. 0,0) AS INT)/6.0 AS Ion, 
CAST (ROUND (b*6. 0,0) AS INT)/6.0 AS lat , 
COUNT (*) AS num 
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FROM gpsSource 

WHERE k_lClass BETWEEN -2 AND -1 AND 

k_lppErrBits < 256 AND 

/* Make a seamless selection (i.e. exclude 
duplicates) in any overlap regions: */ 

(priOrSec=0 OR priOrSec=f rameSetID) 
/* Bin up in 10 arcmin x 10 arcmin cells: */ 
GROUP BY CAST(R0UND(1*6.0,0) AS INT)/6.0, 
CAST (ROUND (b*6. 0,0) AS INT)/6.0 
/* UKIDSS DR2 rows returned: 28,026 

Execution time: 72m 00s */ 

In this example, nearest integer values of / x 6 and 6x6, 
where Z, h (in units of degrees) are Galactic longitude and 
latitude respectively, yield cells of size 10 x 10 arcmin'^. We 
have chosen to use K-band star counts in this case, since 
this passband has the most GPS data at DR2. Note the use 
of the predicate (priOrSec=0 OR priOrSec=frameSetID). 
This uses the 'primary or secondary' flag attribute to select 
only those sources that have no duplicates (priOrSec=0) 
or primaries in the presence of duplicates (priOrSec points 
to the current frame set identifier, indicating the source 
is duplicated but that the current record is the best one 
to use); conversely, all the secondaries of duplicates (and 
only those secondaries) could be selected by specifying 
priOrSec>0 AND priOrSecOframeSetlD. The results of the 
seamless selection in the query above are shown in Figure fTSl 
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Figure 16. Two-colour diagram (cf. Figure 5 of Hewitt et 
al. 2006) illustrating the principal colour space of optical/infrared 
QSO candidate selection (see Section 14.2.51 . 



4-2.5 Optical/infrared selection of QSO candidates 

Usage U2 in Appendix|X]requires two selections; (i) a set of 
point sources satisfying certain optical/infrared colour cuts; 
and (ii) a 1 in lO** sampling of all point sources without those 
colour cuts. Both are easily achieved in SQL - the availabil- 
ity of the view lasPointSource is particularly convenient. 
The following query provides selection (i): 

SELECT psfMag_i-psfMag_z AS imz, 

psfMag_z-j_lAperMag3 AS zmj , 
psfMag_i-yAperMag3 AS imy, 
ymj_lPnt AS ymj 

FROM lasPointSource AS s, 

lasSourceXDR5Photo0bj AS x, 
BestDR5. .PhotoObj AS p 

WHERE 
/* Join predicates: */ 

s.sourcelD = x.masterObjID AND 
x.slaveObjID = p.objID AND 

x.distanceMins < 1.0/60.0 AND 

/* Select only the nearest primary SDSS 
point source crossmatch: */ 
X.distanceMins IN ( 

SELECT MIN(distanceMins) 

FROM lasSourceXDRSPhotoObj 

WHERE masterObjID = x.masterObjID AND 



sdssPrimary = 1 
sdssType = 6 

) AND 
/* Remove any default SDSS mags: */ 

psfMag_i > 0.0 AND 
/* Colour cuts for high-z QSOs from 
Hewett et al . (2006) and Venemans 
et al. (2007) : */ 



AND 



psfMag_i-yAperMag3 > 4.0 AND 
ymj_lPnt < 0.8 AND 

psfMagErr_u > 0.3 AND 
psfMagErr_g > 0.3 AND 
psfMagErr_r > 0.3 
/* UKIDSS DR2 rows returned: 12 

Execution time: 19m 56s 



*/ 



where the colour cuts have been determined with refer- 
ence to Hewitt et al. (2006) and Venemans et al. (2007). In 
fact, usage example U2 is somewhat unrealistic in that the 
'legacy' SDSS lacks the depth to detect QSOs having z ~ 7 
as illustrated in Venemans et al. (2007); optical drop-out 
techniques (see later) or deeper optical data are needed for 
the most highly redshifted QSOs. Furthermore, some con- 
tamination from differently deblended sources and poorly 
photometered sources near very bright stars is present in 
exactly the position where the high redshift QSO locus is ex- 
pected to lie. However the SQL provided here serves at least 
to illustrate how to ask this kind of question in the WSA; 
moreover, it produces a list of a dozen candidates which is 
a viable number for closer scrutiny (e.g. inspection of image 
thumbnails and subsequent spectroscopic follow-up). 

For selection (ii), removing the colour cut predicates 
and adding the predicate . . . AND (sourceIDy,10000)=0 will 
select one in every 10* sources randomly scattered over the 
survey area (the '"/," modulo operator returns the remain- 
der of the number on the left after dividing by that on the 
right). This is because sourcelD is assigned sequentially in 
the source merging procedure and for large increments this 
attribute is not strongly correlated with position. The re- 
sults are illustrated in the two-colour diagram in Figure [TS] 
(1 in 10 sources plotted from UKIDSS DR2 cross-matched 
with SDSS DR5 rather than a l-in-lO'* sampling). 
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Figure 15. K-band star counts in tlie UKIDSS GPS produced by the query given in Section 14.2.41 The scale bar is in units of stars per 
square arcminute. 
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4-2.6 A wide-area, shallow galaxy catalogue 

Usage example U6 in Appendix |A] specifies the selection of 
a galaxy catalogue with full optical/infrared photometry to 
K=18.4 from the intersection of the UKIDSS LAS and SDSS 
optical survey. In the following example, we further extend 
this usage mode to extract redshift information from SDSS 
spectroscopy, and compute absolute magnitudes Mk via an 
Einstein-de Sitter cosmological distance modulus with Hub- 
ble constant Ho = 75 km s~^Mpc~^, all in SQL. The nearest 
crossmatch between the LAS and SDSS with a matching tol- 
erance of 2" is selected: 

SELECT s.ra as alpha, s.dec as delta, 

/* Remove Galactic foreground reddening: */ 

(petroMag_u-extinction_u) 

(petroMag_g-extinction_g) 

(petroMag_r-extinction_r) 

(petroMag_i-extinction_i) 

(petroMag_z-extinction_z) 

(yPetroMag-ay) 

(j_lPetroMag-aj) 

(hPetroMag-ah) 

(kPetroMag-ak) 

z.z 

(modelMag_g-extinction_g) - 
(modelMag_r-ext inct ion_r ) 

(yAperMag3-ay)-(kAperMag3-ak) AS ymk, 

(modelMag_u-extinction_u) - 
(modelMag_g-extinction_g) AS umg, 
/* Einstein-de Sitter cosmology distance modulus 
(note no K-correction, no evolution correction, 
and no internal extinction) : */ 

(kPetroMag-ak) - 25 - 5* ( 
L0G10(2*2 . 998e5* (1+z .z-SQRT(l+z . z) ) /75) 

) AS M_K 
FROM lasExtendedSource AS s , 

lasSourceXDRSPhotoObj AS x, 

BestDRB. .PhotoObj AS p, 

BestDRB . . SpecOb j AS z 
WHERE 
/* Join criteria: */ 

z.specObjID=p.specObjID AND 

s.sourcelD = x.masterObjID AND 

p.objID = x.slaveDbjID AND 

x.distanceMins IN ( 

SELECT MIN(distanceMins) 

FROM lasSourceXDR5Photo0bj 

WHERE masterObjID = x.masterObjID AND 
distanceMins < 2.0/60.0 
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Figure 17. Absolute-magnitude versus colour plot for a wide- 
area, shallow galaxy catalogue extracted using the query in Sec- 
tion 11:2:61 which trawls the UKIDSS LAS-DR2 and SDSS-DR5 
crossmatch (see text). 



) AND 
/* Dereddened magnitude cut as specified: */ 

(kPetroMag-ak) BETWEEN 0.0 AND 18.4 AND 

yPetroMag > AND 

modelMag_u > AND 

modelMag_g > AND 

modelMag_r > AND 
/* Exclude any non spectroscopic redshift 
objects for a clean sample: */ 

z.z BETWEEN 0.01 AND 0.15 AND 

z . zWarning=0 
/* UKIDSS DR2 rows returned: 8,086 

Execution time: 05m 13s */ 

In Figure [T7] we plot Mk (as a proxy for total stellar mass) 
versus (u— g) which shows a bright red clump of ellipticals 
along with a sequence of fainter, bluer star-forming and/or 
spiral galaxies and finally yet fainter, bluer dwarfs. 

Spatial sampling of selections from base table 
lasSource (or indeed any merged source table in the WSA) 
can be determined in several ways. The simplest method 
(e.g. for making an areal coverage plot) is to use the central 
positions of the frame sets available in lasMergeLog: 

SELECT ra,dec 
FROM lasMergeLog 
WHERE j_lmfID > AND 

hmfID > AND 

kmfID > 
/* UKIDSS DR2 rows returned: 6,242 
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Execution time: 



00m 04s */ 



where the predicates require coverage in passbands JHK, but 
not necessarily Y, as is the case in view lasExtendedSource 
for example. The size of each frame set in the LAS is the 
size of one WFCAM detector, or 13.65 arcmin. Figure 1181 
shows the area covered by plotting squares of this size at 
each co-ordinate pair returned by the query. 

More sophisticated functionality is provided via use of 
Hierarchical Triangular Mesh (HTM; Kunszt et al. 2000) in- 
dices which are available as attribute htmID where an equa- 
torial RA,Dec pair are present in most WSA tables. For 
example, the set of HTM triangles covering a given selec- 
tion to a given HTM level (see Kunszt et al. 2000) can be 
obtained using the SQL DISTINCT function along with divi- 
sion by an integer power TV of 4 to mask to the (20 — 7V)^^ 
level where the WSA uses 20-level indexing by default: 

SELECT DISTINCT(htmID/P0WER(4, 20-12)) 
FROM lasSource 
WHERE . . . 

will return the identifiers of the HTM triangles at level 12 
f areaa^^l between 0.86 and 1.8 square arcmin) covered by the 
LAS merged source table for the given predicates. Libraries 
of various routines for the manipulation and translation of 
spatial co-ordinates an associated HTM indices are are avail- 
able online^. Note that the areal coverage maps returned by 
any of these queries are not the maps of survey depth that 
would be needed to compute survey volume corrections. 



4-2.7 Infrared colour-selected sources that are optical 
drop-outs 

Usage Ul in Appendix |X] requires non-detection in opti- 
cal (iz) passbands for an infrared colour-selected sample 
of point sources as cool, substellar candidates (see, for ex- 
ample, Kendall et al. 2007). One could envisage this being 
achieved within the archive by automatically placing aper- 
tures in optical images (i.e. SDSS pixel data) at positions 
having infrared detections. In fact, it is much simpler to use 
the cross-neighbour functionality, requiring non-detection 
in the optical above a certain limit within a given radius of 
an infrared source. In this way it is possible to make a man- 
ageable candidate list in a single SQL SELECT statement. 
In Ul, it is envisioned that the user develops the query for 
a rare object search by refining the search predicates. The 
starting point would be as follows: 

SELECT COUNT (*) 

FROM lasSource 

WHERE 

/* Colour cuts for mid-T ft later: */ 

ymj_lPnt > 0.5 AND 

j_lmhPnt < 0.0 AND 
/* Source not detected above 2sigma within 
1" in SDSS-DR5 i' or z' : */ 

sourcelD NOT IN ( 

SELECT masterObjID 

FROM lasSourceXDRBPhotoObj AS x, 
BestDRS. .PhotoObj AS p 



^^ |http : //www . sdss ■ jhu.edu/htm] 



WHERE p.objID = x.slaveObjID AND 
(psfMagErr_i < 0.5 OR 

psfMagErr_z < 0.5) AND 

x.distanceMins < 1.0/60.0 
) AND 
/* Use only frame sets overlapping with 
SDSS-DR5 : */ 

frameSetID IN ( 
SELECT DISTINCT (frameSetID) 
FROM lasSource AS s, 

lasSourceXDR5Photo0bj AS x 
WHERE s.sourcelD = x.masterObjID 
) 
/* UKIDSS DR2 rows returned: 1 (count=46, 141) 
Execution time: 16m 52s */ 

which counts 46,141 candidates in the UKIDSS DR2 cross- 
match with SDSS-DR5. The first two predicates simply ap- 
ply a colour cut based on prior knowledge of the objects 
being sought. The third predicate involves a subquery to 
exclude any object that has an optical counterpart above 
the specified limit (2(t) in the SDSS, and a further nested 
subquery to only use the nearest optical cross-match within 
1". Finally, a predicate subquery specifies that only LAS 
frame sets that contain SDSS cross-matches should be used 
in this search, since if any LAS imaging data are outwith the 
area covered by the SDSS, they must be excluded since all 
infrared sources in those regions would be counted as optical 
non-detections. 

Clearly, ~ 4.6 x 10'' candidates is an impractically large 
list for any useful purpose. The predicates need to be ex- 
panded to reduce the list of unwanted and spurious sources 
prior to a more intensive inspection of image thumbnails or 
indeed spectroscopic follow-up on large aperture facilities. 
Addition of the following predicates: 

/* Unduplicated or primary duplicates only: */ 

(priOrSec = OR priOrSec = frameSetID) AND 
/* Generally good quality: */ 

yppErrBits < 256 AND 

j_lppErrBits < 256 AND 

hppErrBits < 256 AND 
/* Point-like morphological classification: */ 

mergedClass=-l AND 

mergedClassStat BETWEEN -3.0 AND +3.0 AND 
/* Reasonably circular images in YJH: */ 

yEll < 0.35 AND 

j_lEll < 0.35 AND 

hEll < 0.35 AND 
/* IR pairs within 0.5 arcsec: */ 

j_lXi BETWEEN -0.5 AND +0.5 AND 

j_lEta BETWEEN -0.5 AND +0.5 AND 

hXi BETWEEN -0.5 AND +0.5 AND 

hEta BETWEEN -0.5 AND +0.5 AND 
/* YJ measured to 5 sigma and H to 4sigma: */ 

yAperMag3Err < 0.20 AND 

j_lAperMag3Err < 0.20 AND 

hAperMag3Err < 0.25 
/* UKIDSS DR2 rows returned: 1 (count=25) 
Execution time: 00m 48s */ 

reduces the number of candidates to 25. The first predicate 
limits the search to unique objects where duplicates exist 
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Figure 18. Arcal coverage of the galaxy catalogue selection described in Section 14.2.61 



in overlap regions; the next set makes cuts on quality er- 
ror flags; the next limits the search to point-like, circular 
sources making generous allowance for noisy, high elliptic- 
ities at low signal-to-noise; the penultimate predicate set 
restricts the selection to YJH pairs within 0.5" boxes (the 
LAS pairing criterion is a generous 2.0" - e.g. Table |3|. All 
the predicates on attributes common to all passbands are 
applied across the relevant filter passbands (YJH) to limit 
the sample selection to those sources appearing in all three. 
Substituting SELECT COUNT (*) with 

SELECT dbo.fIAUnameLAS(ra,dec) , 
yAperMagS , 

ym j _ IPnt , ym j _ IPntErr , 
j _ ImhPnt , j _ ImhPntEr r 

and including ORDER BY ra at the end of the query yields 
the results shown in Table [S] Note the syntax and use of 
the function f lAUnameLASO to automatically output lAU 
standard names for any target. The candidate sample pro- 
duced by the full query includes spectroscopically confirmed 
T dwarfs discussed in Lodieu et al. (2007b) and references 
therein. 



5 CONCLUSION 

We have described the WFCAM Science archive (WSA), 
which is the end point in the data flow of UKIRT WF- 
CAM data in the VISTA Data Flow System, and the pri- 
mary point of access for users of survey science products, 
especially those of the United Kingdom Infrared Deep Sky 
Survey (UKIDSS). In particular, we have described: 

• how the top-level requirements and typical usage modes 
informed the design of the WSA; 

• the arrangement of survey data in terms of a set of 
related tables; 

• the implementation of the archive within a commercial 
relational database management system; 

• the curation procedures employed to create science- 
ready survey catalogues from standard pipeline-processed 
products; 

• example real-world astronomy usage modes along with 
typical results. 

The WSA is the prototype science archive for the 
VISTA surveys, and the design of the VISTA Science 
Archive will follow closely that described here. 
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APPENDIX A: TWENTY USAGES OF THE 
WFCAM SCIENCE ARCHIVE 

Here we reproduce the typical usage modes of the WSA 
that were developed in collaboration with the UKIDSS user 
community during the design phase of the project (more 
details are available onlinqij): 

Ul: Count the number of sources in the LAS which sat- 
isfy the colour constraints (Y-J) > 1.0, (J-H) < 0.5 where 
SDSS iz flux limits at the same position are less than 2a. 
User then refines the query as necessary to give a reasonable 
number of candidates. When satisfied, the user requests a 
list, selecting output attributes from those available for the 
LAS, and finder charts in JHK for each object. 

U2: List all star-like objects with izYJHK 
SDSS/UKIDSS-LAS colours consistent with the colours of 
quasars at redshifts 5.8 < 2 < 7.2 or z > 7.2 (user specifies 
cuts in colour space). Return plots of (i-z) versus (z-J) and 
(i-Y) versus (Y-J) with these sources plotted in a specified 
symbol type, with 1 in every 10,000 other stellar sources 
plotted as points. 

U3: For a given cluster target in the UKIDSS GCS, 
make a candidate membership list via selection of stellar 
sources in colour-magnitude, colour-colour and proper mo- 
tion space. Cross-correlate the candidate list against a user- 
supplied catalogue of optical/near-infrared detections in the 
same region. 

U4: From the UKIDSS LAS, provide a list of aU stel- 
lar objects that have measured proper motions greater than 
5x their estimated proper motion error; additionally give a 
count of all stellar objects that are unpaired between the two 
epochs of the LAS observations with specified conditions on 
image quality fiags. User then refines these conditions to pro- 
duce a manageable list of very high proper motion candidate 
stars. Return finder charts in JHK for all candidates. 



^^ |http : //surveys . roe . ac.uk/wsa/pubs/wsausage ■ html | 



U5: From the UKIDSS DXS & UDS, construct galaxy 
catalogues. User selects all non-stellar sources satisfying 
quality criteria. User also requires the spatial sampling of 
this catalogue. Cross-correlate the galaxy catalogues against 
user-supplied optical catalogue in the same region. 

U6: From the UKIDSS LAS, construct a galaxy cat- 
alogue for all non-stellar sources satisfying K < 18.4 and 
given quality criteria; return full photometric list from SDSS 
& UKIDSS; ugrizYJHK. User also requires the spatial sam- 
pling of this catalogue. 

U7: From the UKIDSS UDS, select a sample of galaxies 
with colours and morphology consistent with being elliptical 
galaxies. Provide a spatial mask to enable determination of 
sample characteristics. Provide a measure of the half-light 
radius for each galaxy. 

U8: From the UKIDSS GPS, provide star counts in 10 
arcmin cells on a grid in Galactic longitude and latitude; 
also provide a list of cells where there is any quality issue 
rendering that cell's value inaccurate. 

U9: From the UKIDSS GPS, provide a list of all sources 
that have brightened by a given amount in the K band. 

UIO: Provide a plot of g-J vs J-K for all point-like 
sources detected in the UKIDSS/LAS survey subject to 
quality constraints. User interacts with the plot to fit a 
straight line (g-J)=a-|-b(J-K) to the main sequence stars. 
Then find aU UKIDSS/LAS sources with g-J>a-Hb(J-K), 
4>g-J>-l, and 3>J-K>-1. 

Ull: Construct H2-K difference image maps for all 
frames within a specified subregion surveyed by the GPS. 

U12: Find all galaxies with a de Vaucouleurs profile and 
infrared colours consistent with being an elliptical galaxy in 
the Virgo region of the UKIDSS LAS. 

U13: Given input co-ordinates and a search radius (ar- 
bitrary system and reference frame) provide a list of all WF- 
CAM observations ever taken that contain data in all or part 
of the specified area. 

U14: Provide a list of point-like sources with multiple 
epoch measurements which have light variations > 0.1 mag- 
nitudes in J, H or K. 

U15: From any UKIDSS data, where multiple epoch 
measures exist for the same object, provide a list of anything 
moving more than X arcsec per hour. 

U16: Provide a list of star-like objects that are 1% rare 
for the 3-colour attributes. 

U17: For a given device in a tile, give me all images 
from the UDS corresponding to that frame, stacked in 10 
day bins. 

U18: Give me a true colour JHK image mosaic using 
frames in the LAS centred at given co-ordinates (arbitrary 
reference frame and system) with 2 degree width and re- 
binned so that the entire mosaic is returned as a 2048x2048 
pixel image. 

U19: Find all detected sources from all UKIDSS sur- 
veys within 3x the error boxes of a user supplied list of X-ray 
transient sources. 

U20: For all sources in a user-supplied radio catalogue 
of HII regions in the GPS, return the Br-gamma surface 
brightness in an aperture of X arcsec 
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APPENDIX B: STRUCTURED QUERY 
LANGUAGE DATA RETRIEVAL 
FUNDAMENTALS 

The basic, general form of a Structured Query Language 
(SQL; Klein & Klein 2001) statement for data retrieval 
(i.e. a query statement) in an RDBMS is as follows: 

SELECT column-l [, coluinn-2, ...] 

FROM table-set-1 [, table-set-2, ...] 

WHERE condition-1 [ AND I OR condition-2 . . . N] (Bl) 

The column definition is generally a comma- 
separated list of attribute names from columns con- 
tained in the table set defined in the FROM clause, 
e.g. SELECT ra, dec, frameSetID ..., but great fiex- 
ibility is available in SQL: expressions involving literal 
constants, mathematical functions and statistical aggre- 
gates are all possible: 



SELECT 'hello world' 
SELECT ra/15.0 AS raHours, ... 
SELECT AVG (CDS (RADIANS (dec))), ... 
SELECT COUNT (DISTINCT multiframelD) , 



(B2) 
(B3) 
(B4) 
(B5) 



are all legal WSA SQL SELECT expressions; example (B2) is 
a complete SQL statement that, insofar as SQL can be con- 
sidered a programming language, demonstrates the classic 
first step in learning the programming syntax - it returns a 
results set consisting of one row having one column having 
the specified string constant value. A more detailed explana- 
tion of SQL SELECT is given onhn^EI at the WSA website in 
the 'SQL Cookbook', while a complete description including 
all standard clauses and non-standard Microsoft SQL Server 
extensions is available elsewherq^. Note that Microsoft SQL 
syntax is not case-sensitive - mixed upper and lower case is 
used in the examples in this paper for clarity only. 

The table set definition in its simplest form consists of 
the name of a single table, e.g. 



SELECT ra, dec, frameSetID 
FROM dxsMergeLog 



(B6) 



returns the equatorial co-ordinates of all frame sets in 
the UKIDSS DXS along with their unique identification 
numbers that have been assigned in the WSA curation 
procedure. Once again, great fiexibility is afi'orded in SQL 
in the table set definition: table-set-N may be any ex- 
pression that defines a tabular dataset, e.g. a table name, a 
view name, or even another SELECT statement. For example, 

SELECT t.* 
FROM ( 

SELECT ra, dec, frameSetID 

FROM dxsMergeLog 
) AS t (B7) 

is an unnecessarily complicated, but nonetheless legal, SQL 



^^ http://surveys.roe.ac.uk/wsa/sqlcookbook.html 

^^ http://msdn2.microsoft.com/en-us/library/msl89826.aspx 



equivalent to statement B6 above (the nested SELECT is com- 
monly known as a subquery in this context; note the use of 
the alias "t" to conveniently label the subquery rowset for 
references elsewhere in the statement). 

By far the most common table set definition that a 
user will need when retrieving data via free-form SQL 
statements is a comma-separated list of related tables. 
With reference to the relational model in Section 12.4.31 
Figure O consider the case where a user wishes to obtain 
a list of the required filters (WSA filter unique identifiers 
filter ID and number of multi-epoch passes in that filter) 
for the UKIDSS programmes, along with generic informa- 
tion on each programme. Since all the relevant information 
is spread between tables Progranraie and RequiredFilters, 
a selection from those two is required: 

SELECT tl.prograiranelD, tl .description, t2.* 

FROM Programme AS tl, RequiredFilters AS t2 (B8) 

This query, however, results in the cartesian product of the 
two tables rather than a union of associated rows. Most of 
the rows in the results set produced by B8 are of course 
meaningless, since all A'^ rows in Programme are joined, one 
by one, with all M rows of in RequiredFilters resulting 
in A'^ X M rows. In order to produce the selection required, 
a WHERE clause must be used to associate the related rows. 



WHERE tl .programmelD = t2 .programmelD 



(B9) 



since any rows where programmelD is difi'erent in the two 
tables are not related. Generally speaking, when querying 
data across A*' tables there should be at least N — 1 WHERE 
clause filters associating related attributes across the tables. 
The attributes to use in filtering are easily determined us- 
ing the WSA schema browser (see Section [3] in the main 
text). They are nearly always indexed primary keys in the 
RDBMS implementation so are highlighted and are at the 
top of each table's attribute list. Moreover a foreign key ref- 
erence is noted at the top of each table definition for every 
many-to-one relationship in the data model; referencing at- 
tributes are generally the ones to filter on in the WHERE clause 
of a join query. Implicit join queries are very common in 
normalised relational database (e.g. Section |4] in the main 
text). The RDBMS design is optimised for the normal form, 
required storage space is minimised, and query performance 
is optimised for speed. 

Otherwise, the WHERE clause is simply a list of condi- 
tional statements linked by logical operators (usually AND). 
These conditions are known as predicates. Comparison 
predicates are common: 



WHERE (ra/15.0 < 12.0 OR dec >=+35.0) 
filterlD <> 3 



AND 



(BIO) 



Other types of predicate are defined in SQL - again, see the 
WSA SQL Cookbook or other online guides to the language. 
Finally, there are some powerful optional clauses 
available to the SELECT statement. An ORDER BY clause can 
be specified, which sorts the results set returned by the 
specified expression. For example. 
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SELECT ra, dec, frameSetID 

FROM dxsMergeLog 

ORDER BY ra ASC (Bll) 

returns the same rows as B6 but in order of increasing, 
i.e. ASCending, RA; specify DESC for descending order. 
Note that without an ORDER BY clause, the order in which 
rows are retrieved from an RDBMS is undefined and, more- 
over, generally unrepeatable - the order can change between 
two consecutive runs of the same query. 

Furthermore, particularly useful for summarising the 
characteristics of data in very large tables is the GROUP BY 
optional clause. This, along with bulit-in aggregate func- 
tions, enables the user to produce summary quantities or 
statistics for large amounts of data arbitrarily grouped 
together on an expression involving one or more column 
names. The GROUP BY clause is best illustrated with a few 
examples. 

SELECT filter ID, COUNT (*) AS totalFrames 

FROM Multiframe 

GROUP BY filterlD (B12) 

is a simple example which returns a count of the number 
of frames in each of the different filters used in observing. 
Note the use of the aggregate function COUNT in B12 above. 
Queries involving GROUP BY will generally use such built-in 
aggregate functions, and this is a particularly powerful 
combination. Other aggregate functions are available 
including minimum/maximum (MIN/MAX), average (AVG), 
summation (SUM) and statistical aggregates, for example 
standard deviation (STDEV). A slightly more complicated 
example is 

SELECT frameSetID, AVG(ra) AS meanRA, 

AVG (dec) AS meanDec, 

COUNT (*) AS numSources 
FROM lasSource 

GROUP BY frameSetID 
HAVING AVG (dec) > 0.0 (B13) 

which returns a list of all northern hemisphere frame sets 
in the UKIDSS LAS, their mean RAs/Decs and a count of 
the number of sources in each. Note the additional HAVING 
clause: just as SELECT may have a WHERE clause to filter rows 
in the tablc(s) specified in the FROM clause, GROUP BY may 
include a HAVING clause to filter rows in the table formed by 
the specified grouping. 



